Histopathological Cancer Detection with Deep Neural Networks

14 Apr 2019


Being able to automate the detection of metastasised cancer in pathological scans with machine learning and deep neural networks is an area of medical imaging and diagnostics with promising potential for clinical usefulness.

Here we explore a particular dataset prepared for this type of of analysis and diagnostics - The PatchCamelyon Dataset (PCam).

PCam is a binary classification image dataset containing approximately 300,000 labeled low-resolution images of lymph node sections extracted from digital histopathological scans. Each image is labelled by trained pathologists for the presence of metastasised cancer.

The goal of this work is to train a convolutional neural network on the PCam dataset and achieve close to, or near state-of-the-art results.

As we'll see, with the Fastai library, we achieve 98.6% accuracy in predicting cancer in the PCam dataset.

We approach this by preparing and training a neural network with the following features:

  1. Transfer learning with a convolutional neural net (Resnet50) as our backbone.
  2. The following data augmentations:
    • Image resizing
    • Random cropping
    • Horizontal and vertical axis image flipping
  3. Fit one cycle method to optimise learning rate selection for our training.
  4. Discriminative learning rates to fine-tune.

In addition we apply the following out of the box optimisations throughout our training:

  1. Dropout.
  2. Batch normalisation.
  3. Maxpooling.
  4. ReLU activations.

This notebook presents research and an analysis of this dataset using Fastai + Pytorch and is provided as a reference, tutorial, and open source resource for others to refer to. It is not intended to be a production ready resource for serious clinical application. We work here instead with low resolution versions of the original high-res clincal scans in the Camelyon16 dataset for education and research. This proves useful ground to prototype and test the effectiveness of various deep learning algorithms.

The Data

Examples above of a metastatic region (from Camelyon16)

Original Source: Camelyon16

PCam is actually a subset of the Camelyon16 dataset; a set of high resolution whole-slide images (WSI) of lymph node sections. This dataset is made available by the Diagnostic Image Analysis Group (DIAG) and Department of Pathology of the Radboud University Medical Center (Radboudumc) in Nijmegen, The Netherlands. The following is an excerpt from their website: https://camelyon16.grand-challenge.org/Data/

The data in this challenge contains a total of 400 whole-slide images (WSIs) of sentinel lymph node from two independent datasets collected in Radboud University Medical Center (Nijmegen, the Netherlands), and the University Medical Center Utrecht (Utrecht, the Netherlands).

The first training dataset consists of 170 WSIs of lymph node (100 Normal and 70 containing metastases) and the second 100 WSIs (including 60 normal slides and 40 slides containing metastases).

The test dataset consists of 130 WSIs which are collected from both Universities.

PatchCam (Kaggle)

PCam was prepared by Bas Veeling, a Phd student in machine learning for health from the Netherlands, specifically to help machine learning practitioners interested in working on this particular problem. It consists of 327,680, 96x96 colour images. An excellent overview of the dataset can be found here: http://basveeling.nl/posts/pcam/, and also available via download on github where there is further information on the data: https://github.com/basveeling/pcam

This particular dataset is downloaded directly from Kaggle through the Kaggle API, and is a version of the original PCam (PatchCamelyon) datasets but with duplicates removed.

PCam is intended to be a good dataset to perform fundamental machine learning analysis. As the name suggests, it's a smaller version of the significanlty larger Camelyon16 dataset used to perform similar analysis (https://camelyon16.grand-challenge.org/Data/)

From the author's words:

PCam packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and whole-slide image diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty, and explainability.


Hardware

We perform our training on an Ubuntu 18 machine with single RTX 2070 GPU using 16bit precision.

Fast AI imports

In [1]:
from fastai.vision import *
from fastai.metrics import error_rate

# Import Libraries here
import os
import json 
import shutil
import zipfile

%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Set paths that you'll use in this notebook
root_dir = "/home/adeperio/"

# notebook project directories
base_dir = root_dir + 'ml/pcam/'
!mkdir -p "{base_dir}"
In [2]:
# set the random seed
np.random.seed(2)

Kaggle SDK/API and downloading dataset

The data we are using lives on Kaggle. We use Kaggle's SDK to download the dataset directly from there. To work with the Kaggle SDK and API you will need to create a Kaggle API token in your Kaggle account.

When logged into Kaggle, navigate to "My Account" then scroll down to where you can see "Create New API Token". This will download a JSON file to your computer with your username and token string. Copy these contents to you ~/.kaggle/kaggle.json token file.

In [ ]:
# First we make sure we have the kaggle sdk installed.
# We assume here that on your machine you have the kaggle.json token file in ~/.kaggle/
!pip install kaggle

Kaggle API: Download Competition Data

In [ ]:
# Can list available Kaggle competitions here if needed
# !kaggle competitions list

# Download the histopathological data
!kaggle competitions download -c histopathologic-cancer-detection -p "{base_dir}"
In [ ]:
#now unzip the training files
!mkdir -p "{base_dir}train/"
dest_dir_train = Path(base_dir + 'train/')
print(base_dir + 'train.zip')
train_zip = zipfile.ZipFile(base_dir + 'train.zip', 'r')
train_zip.extractall(dest_dir_train)
train_zip.close()
In [ ]:
#now unzip the test files
!mkdir -p "{base_dir}test/"
dest_dir_test = Path(base_dir + 'test/')
test_zip = zipfile.ZipFile(base_dir + 'test.zip', 'r')
test_zip.extractall(dest_dir_test)
test_zip.close()
  
In [ ]:
# then extract the labels 
dest_dir_csv = Path(base_dir)
labels_csv_zip = zipfile.ZipFile(base_dir + 'train_labels.csv.zip', 'r')
labels_csv_zip.extractall(dest_dir_csv)
labels_csv_zip.close()
In [4]:
# Check the download here
path = Path(base_dir)
path.ls()
Out[4]:
[PosixPath('/home/adeperio/ml/pcam/test'),
 PosixPath('/home/adeperio/ml/pcam/.ipynb_checkpoints'),
 PosixPath('/home/adeperio/ml/pcam/pcam-ml-ubuntu.ipynb'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50-unfrozen.pth'),
 PosixPath('/home/adeperio/ml/pcam/stage-1.pth'),
 PosixPath('/home/adeperio/ml/pcam/train_labels.csv'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50.pth'),
 PosixPath('/home/adeperio/ml/pcam/test.txt'),
 PosixPath('/home/adeperio/ml/pcam/models'),
 PosixPath('/home/adeperio/ml/pcam/train'),
 PosixPath('/home/adeperio/ml/pcam/sample_submission.csv.zip'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-34.pth'),
 PosixPath('/home/adeperio/ml/pcam/train_labels.csv.zip'),
 PosixPath('/home/adeperio/ml/pcam/train.zip'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50-1.pth'),
 PosixPath('/home/adeperio/ml/pcam/test.zip')]

Data Preparation

With our data now downloaded, we create an ImageDataBunch object to help us load the data into our model, set data augmentations, and split our data into train and test sets.

In [3]:
tfms = get_transforms(do_flip=True)
In [4]:
bs=64 # also the default batch size
data = ImageDataBunch.from_csv(
    base_dir, 
    ds_tfms=tfms, 
    size=224, 
    suffix=".tif",
    folder="train", 
    test="test",
    csv_labels="train_labels.csv", 
    bs=bs)

ImageDataBunch wraps up a lot of functionality to help us prepare our data into a format that we can work with when we train it. Let's go through some of the key functions it performs below:

Data Augmentation

By default ImageDatabunch performs a number of modifications and augmentations to the dataset:

  1. Centre crop the images
  2. There's also some randomness introduced on where and how it crops for the purposes of data augmentation
  3. It's important that all the images need to be of the same size for the model to be able to train on.

Image Flipping

There are various other data augmentations we could also use. But one of the key ones that we activate is image flipping on the vertical.

For pathology scans this is a reasonable data augmentation to activate, as there is little importance on whether the scan is oriented on the vertical axis or horizontal axis,

By default fastai will flip on the horizontal, but we need to turn on flipping on the vertical.

Batch Size

We'll be using the 1cycle policy (fit_one_cycle()) to train our network (more on this later). This is a hyper parameter optimisation that allows us to use higher learning rates.

Higher learning rates acts as a form of regularisation in 1cycle policy. Recall that a small batch size adds regularisation, so when using large batch sizes in 1cycle learning it allows for larger learning rates to be used.

The recommendation here is to use a batch size that is the largest our GPU supports when using 1cycle policy to train.

Training, validation and test sets

  1. We specify the folder location of the data (where the subfolders train and test exist along with the csv data)
  2. ImageDataBunch under the hood splits out the images (in the train sub-folder) into a training set and validation set (defaulting to an 80/20 percent split). There are 176,020 images in the training set and about 44,005 in the validation set.
  3. We also specify the location of the test sub-folder, that contains unlabelled images. Our learning model will measure accuracy and the error rates against this dataset
  4. The CSV file containing the data labels is also specified

Image size on base architecture and target architecture

Images in the target PCam dataset are square images 96x96. However, when bringing a pre-trained ImageNet model into our network, which was trained on larger images, we need to set the size accordingly to respect the image sizes in that dataset.

We choose 224 for size as a good default to start with.

Normalising the images

Once we have setup the ImageDataBunch object, we also normalise the images.

Normalising the images uses the mean and standard deviation of the images to transform the image values into a standardised distribution that is more efficient for a neural network to train on.

In [5]:
# now normalise the images
data.normalize(imagenet_stats)
Out[5]:
ImageDataBunch;

Train: LabelList (176020 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
1,0,0,1,1
Path: /home/adeperio/ml/pcam;

Valid: LabelList (44005 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
0,0,1,1,1
Path: /home/adeperio/ml/pcam;

Test: LabelList (57458 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: EmptyLabelList
,,,,
Path: /home/adeperio/ml/pcam

Below we take a look at some random samples of the data so we can get some understanding of what we are feeding into our network. This is a binary classification problem so there's only two classes (0 - 1)

In [8]:
data.show_batch(rows=4, figsize=(10, 10))

Learner (CNN Resnet50)

Once we have a correctly setup the ImageDataBunch object, we can now pass this, along with a pre-trained ImageNet model, to a cnn_learner. We will be using Resnet50 as our backbone.

In [6]:
# I am using a GPU that supports 16bit precision, so we also switch this support mode to on. 
# If your GPU does not have that support you can omit the to_fp16() call
learn = cnn_learner(data, models.resnet50, metrics=error_rate, callback_fns=ShowGraph).to_fp16()

Fastai wraps up a lot of state-of-the-art computer vision learning in its cnn_learner:

  • Connects our pre-trained model with a layer group of fully connected layers.
  • ReLU activations
  • Batch normalisation
  • Max pooling
  • Drop out

Importantly, we also specify a backbone network, that has been pre-trained on the ImageNet dataset, so that we can use transfer learning in our training.

Transfer learning

Starting with a backbone network from a well-performing model that was already pre-trained on another dataset is a method called transfer learning.

Transfer learning works on the premise that instead of training your data from scratch, you can use the learning (ie the learned weights) from another machine learning model as a starting point.

This is an incredibly effective method of training, and underpins current state-of-the-art practices in training deep neural networks.

When using pre-trained models we leverage, in particular, the learned features that are most in common with both the pre-trained model and the target dataset (PCam).

So for example, for models pre-trained on ImageNet such as Resnet50, training will leverage the common features (for example such as lines, geometry, patterns) that have already been learnt from the base dataset (in particular in the first few layers) to train on the target dataset.

For our model, we'll be using Resnet50. Resnet50 is a residual neural net trained on ImageNet data using 50 layers, and will provide a good starting point for our network.

Using learn.model we can see the various layers and elements of the neural network that are used.

In [7]:
learn.model
Out[7]:
Sequential(
  (0): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
        (downsample): Sequential(
          (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): Bottleneck(
        (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (2): Bottleneck(
        (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
    )
    (5): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
        (downsample): Sequential(
          (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): Bottleneck(
        (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (2): Bottleneck(
        (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (3): Bottleneck(
        (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
    )
    (6): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
        (downsample): Sequential(
          (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): Bottleneck(
        (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (2): Bottleneck(
        (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (3): Bottleneck(
        (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (4): Bottleneck(
        (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (5): Bottleneck(
        (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
    )
    (7): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
        (downsample): Sequential(
          (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): Bottleneck(
        (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
      (2): Bottleneck(
        (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace)
      )
    )
  )
  (1): Sequential(
    (0): AdaptiveConcatPool2d(
      (ap): AdaptiveAvgPool2d(output_size=1)
      (mp): AdaptiveMaxPool2d(output_size=1)
    )
    (1): Flatten()
    (2): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.25)
    (4): Linear(in_features=4096, out_features=512, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.5)
    (8): Linear(in_features=512, out_features=2, bias=True)
  )
)

Training and fit one cycle

Fit one cycle

We will be training our network with a method called fit one cycle. This optimisation is a way of applying a variable learning rate across the total number of epochs in our training run for a particular layer group. This has proven to be an extremely effective way to tune the learning rate hyperparamter for training.

Fit one cycle varies the learning rate from a minimum value at the first epoch (by default lr_max/div_factor), up to a pre-determined maximum value (lr_max), before descending again to a minimum across the remaining epochs. This min-max-min learning rate variance is called a cycle.

An excellent overview can be found here in the fastai docs https://docs.fast.ai/callbacks.one_cycle.html along with a more detailed explanation in the original paper by Leslie Smith [7], where this method of hyperparameter tuning was proposed.

So how then do we determine the most suitable maximum learning rate to enable fit one cycle? We run fastai's lr_find() method.

Running lr_find before unfreezing the network yields the graph below. We want to choose a learning rate just before the loss starts to exponetially increase.

In [8]:
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

From a visual observation of the resulting learning rate plot, starting with a learning rate of 1e-02 seems to be a reasonable choice for an initial lr value.

Freeze

By default we start with our network frozen. This means that the layers of our pre-trained Resnet50 model have trainable=False applied, and training begins only on the target dataset. The learning rate we provide to fit_one_cycle() applies only to that layer group for this initial training run.

In [9]:
learn.fit_one_cycle(30, slice(1e-02))
Total time: 5:12:35

epoch train_loss valid_loss error_rate time
0 0.239715 0.193532 0.075151 10:23
1 0.205948 0.153948 0.057380 10:23
2 0.174833 0.126729 0.044495 10:26
3 0.170107 0.119771 0.041268 10:26
4 0.143925 0.113238 0.038427 10:26
5 0.149095 0.126973 0.042063 10:26
6 0.153990 0.114695 0.037928 10:26
7 0.136065 0.111566 0.036132 10:25
8 0.133533 0.160982 0.034564 10:26
9 0.139547 0.135928 0.039155 10:25
10 0.134291 0.100237 0.034587 10:25
11 0.120010 0.112066 0.034314 10:27
12 0.134262 0.092657 0.032065 10:26
13 0.117560 0.121253 0.039382 10:22
14 0.107682 0.108003 0.037132 10:22
15 0.113562 0.111619 0.039586 10:22
16 0.107812 0.094989 0.033269 10:22
17 0.106355 0.082264 0.027792 10:21
18 0.103998 0.082920 0.027406 10:21
19 0.100962 0.082800 0.026838 10:22
20 0.099170 0.075794 0.025020 10:23
21 0.092189 0.074788 0.024497 10:23
22 0.090074 0.073442 0.024497 10:23
23 0.089671 0.072417 0.024520 10:24
24 0.082923 0.069527 0.022975 10:24
25 0.068498 0.069962 0.022657 10:24
26 0.072563 0.070755 0.023656 10:24
27 0.073710 0.068305 0.022497 10:24
28 0.077948 0.069000 0.022634 10:23
29 0.073082 0.067487 0.022407 10:23
In [10]:
learn.save(base_dir + 'pcamv4-stage-1')
In [8]:
preds,y, loss = learn.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc*100))
The accuracy is 97.75934600830078 %.

Analysing first results

Analysing the graph of the initial training run, we can see that the training loss and validation loss both steadily decrease and begin to converge while the training progresses.

With the validation loss steadily decreasing, there are no clear signs of significant overfitting or underfitting.

Accuracy at the moment is 97.76%.

We can learn more about this training run by using Fastai's confusion matrix and plotting our top losses.

In [10]:
interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(data.valid_ds)==len(losses)==len(idxs)
Out[10]:
True
In [16]:
interp.plot_confusion_matrix(figsize=(7,7), dpi=60)
In [14]:
interp.plot_top_losses(16, figsize=(16,16))

The confusion matrix is a handy tool to help us obtain more detail on the effectiveness of the training so far. Specifically, we get some clarity on the amount of false positives and false negatives predicted by our neural net.

Plotting our top losses allows us to examine specific images in more detail. Fastai generates a heatmap of images that we predicted incorrectly. The heatmap allows us to examine areas of images which confused our network. Its useful to do this so we obtain better context around how our model is behaving on each test run, and direct us to clues as to how to improve it.

Fine-tuning, unfreezing, and discriminative learning rates

Initial results are already good on the first training run. But with some more fine-tuning, we can actually do a little better.

Transfer learning + Fine-tuning = Better Generalisation

Transfer learning alone brings us much further than training our network from scratch. But this method is prone to optimisation difficulties present between fragile co-adpated layers when connecting a per-trained network. We counter this by fine-tuning our model; making the all layers of our network, including the pre-trained Resnet50 layers, to be trainable. When we unfreeze we train across all of our layers. (See [6])

This leads to better results and a better ability to generalise to new examples.

Discriminative Learning Rates and 1cycle

With all of our layers in our network unfrozen and open for training, we can now also make use of discriminative learning rates in conjunction with fit_one_cycle to improve our optimisations even further.

Discriminative learning rates lets us apply specific learning rates to layer groups in our network, optimising for each group. Fit one cycle then operates on these values and uses them to vary learning rates according to the 1cycle policy. (https://docs.fast.ai/basic_train.html#Discriminative-layer-training)

How do we find the best range of learning rates to use for fit 1cycle? We can use lr_find() to help us with that.

In [65]:
# Using the plot from the learning rate finder above, 
# find a value for the first value in the max_lr slice 
# that is well before the point when the loss start to significantly degrade
learn.unfreeze()

learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

Analysing our lr plot above, we choose a range of learning rates just before the loss begins to radically increase and apply that as a slice to our fit_one_cycle method below.

From our plot above, it seems reasonable to select an upper bound rate of 1e-4, and as a recommended rule for our lower bound rate, we can select a value 10x smaller than our upperbound, in this case 1e-5.

The lower bound rate will apply to the layers in our pre-trained Resnet50 layer group. The weights here are already well learned so we can proceed with a slower learning rate for this group of layers.

The upper bound rate gets applied to the final layer group of fully connected layers previously trained in our last training run on the target dataset. The layers in this group will benefit from a faster learning rate.

In [66]:
learn.fit_one_cycle(30, max_lr=slice(1e-5,1e-4))
Total time: 6:41:31

epoch train_loss valid_loss error_rate time
0 0.068424 0.066296 0.021475 13:17
1 0.085035 0.068620 0.022861 13:19
2 0.077056 0.072795 0.024724 13:19
3 0.082423 0.076358 0.025315 13:22
4 0.094262 0.075467 0.025202 13:19
5 0.089292 0.095523 0.031724 13:21
6 0.082299 0.069068 0.022634 13:21
7 0.085847 0.074969 0.025293 13:20
8 0.067565 0.072602 0.023634 13:21
9 0.082363 0.066331 0.022520 13:21
10 0.061088 0.063667 0.021338 13:20
11 0.067841 0.069676 0.023975 13:20
12 0.052601 0.059232 0.018702 13:24
13 0.050393 0.061368 0.019702 13:22
14 0.054348 0.059377 0.019271 13:22
15 0.047280 0.060785 0.018407 13:22
16 0.040474 0.061224 0.018134 13:22
17 0.032307 0.055822 0.016180 13:21
18 0.035760 0.055463 0.015998 13:21
19 0.034614 0.055634 0.015930 13:22
20 0.027590 0.056590 0.016203 13:21
21 0.021472 0.055784 0.015407 13:23
22 0.021862 0.053663 0.014748 13:22
23 0.016913 0.056335 0.015044 13:24
24 0.015263 0.059568 0.014135 13:25
25 0.016801 0.056743 0.013839 13:27
26 0.016958 0.059268 0.013953 13:26
27 0.010636 0.057276 0.013703 13:27
28 0.010079 0.058188 0.014089 13:21
29 0.009765 0.058023 0.014044 13:21
In [67]:
learn.save(base_dir + 'pcamv4-stage-2')
In [17]:
preds,y, loss = learn.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc*100))
The accuracy is 98.59561157226562 %.
In [20]:
interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(data.valid_ds)==len(losses)==len(idxs)
Out[20]:
True
In [21]:
interp.plot_top_losses(16, figsize=(16,16))

Final analysis

In the final fine-tuning training run, we can see that our training loss and validation loss begin to diverge from each other now mid training, and that the training loss is progressively improving at a much faster rate than validation loss, steadily decreasing until stabilising to a steady range of values in the final epochs of the run.

The model is probably just complex enough. Any further increases in our validation loss, in the presence of a continually decreasing training loss, would result in overfitting, failing to generalise well to new examples. Training for too long would risk this.

Finalising the at this point in our training yields a fine-tuned accuracy of 98.6% over our stage 1 training run result.

Conclusion

With an approach using deep convolutional neural networks, transfer learning, and fit one cycle optimisations, we can achieve 98.6% accuracy on detecting cancer in the PCam dataset.

References

[1] Practical Deep Learning for Coders, v3. Fastai. Jeremy Howard. Rachel Thomson. https://course.fast.ai/index.html

[2] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. "Rotation Equivariant CNNs for Digital Pathology". arXiv:1806.03962

[3] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. doi:jama.2017.14585

[4] Camelyon16 Challenge https://camelyon16.grand-challenge.org

[5] Kaggle. Histopathologic Cancer Detection - Identify metastatic tissue in histopathologic scans of lymph node sections https://www.kaggle.com/c/histopathologic-cancer-detection

[6] Jason Yosinski. Jeff Clune. Yoshua Bengio. Hod Lipson. "How transferable are features in deep neural networks? ". arXiv:1411.1792v1 [cs.LG]

[7] Leslie N. Smith. "A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay". arXiv:1803.09820v2 [cs.LG]

In [ ]: