Histopathological Cancer Detection with Deep Neural Networks

14 Apr 2019


Being able to automate the detection of metastasised cancer in pathological scans with machine learning and deep neural networks is an area of medical imaging and diagnostics with promising potential for clinical usefulness.

Here we explore a particular dataset prepared for this type of of analysis and diagnostics - The PatchCamelyon Dataset (PCam).

PCam is a binary classification image dataset containing approximately 300,000 labeled low-resolution images of lymph node sections extracted from digital histopathological scans. Each image is labelled by trained pathologists for the presence of metastasised cancer.

The goal of this work is to train a convolutional neural network on the PCam dataset and achieve close to, or near state-of-the-art results.

As we'll see, with the Fastai library, we achieve 98.6% accuracy in predicting cancer in the PCam dataset.

We approach this by preparing and training a neural network with the following features:

  1. Transfer learning with a convolutional neural net (Resnet50) as our backbone.
  2. The following data augmentations:
    • Image resizing
    • Random cropping
    • Horizontal and vertical axis image flipping
  3. Fit one cycle method to optimise learning rate selection for our training.
  4. Discriminative learning rates to fine-tune.

In addition we apply the following out of the box optimisations throughout our training:

  1. Dropout.
  2. Batch normalisation.
  3. Maxpooling.
  4. ReLU activations.

This notebook presents research and an analysis of this dataset using Fastai + Pytorch and is provided as a reference, tutorial, and open source resource for others to refer to. It is not intended to be a production ready resource for serious clinical application. We work here instead with low resolution versions of the original high-res clincal scans in the Camelyon16 dataset for education and research. This proves useful ground to prototype and test the effectiveness of various deep learning algorithms.

The Data

Examples above of a metastatic region (from Camelyon16)

Original Source: Camelyon16

PCam is actually a subset of the Camelyon16 dataset; a set of high resolution whole-slide images (WSI) of lymph node sections. This dataset is made available by the Diagnostic Image Analysis Group (DIAG) and Department of Pathology of the Radboud University Medical Center (Radboudumc) in Nijmegen, The Netherlands. The following is an excerpt from their website: https://camelyon16.grand-challenge.org/Data/

The data in this challenge contains a total of 400 whole-slide images (WSIs) of sentinel lymph node from two independent datasets collected in Radboud University Medical Center (Nijmegen, the Netherlands), and the University Medical Center Utrecht (Utrecht, the Netherlands).

The first training dataset consists of 170 WSIs of lymph node (100 Normal and 70 containing metastases) and the second 100 WSIs (including 60 normal slides and 40 slides containing metastases).

The test dataset consists of 130 WSIs which are collected from both Universities.

PatchCam (Kaggle)

PCam was prepared by Bas Veeling, a Phd student in machine learning for health from the Netherlands, specifically to help machine learning practitioners interested in working on this particular problem. It consists of 327,680, 96x96 colour images. An excellent overview of the dataset can be found here: http://basveeling.nl/posts/pcam/, and also available via download on github where there is further information on the data: https://github.com/basveeling/pcam

This particular dataset is downloaded directly from Kaggle through the Kaggle API, and is a version of the original PCam (PatchCamelyon) datasets but with duplicates removed.

PCam is intended to be a good dataset to perform fundamental machine learning analysis. As the name suggests, it's a smaller version of the significanlty larger Camelyon16 dataset used to perform similar analysis (https://camelyon16.grand-challenge.org/Data/)

From the author's words:

PCam packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and whole-slide image diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty, and explainability.


Hardware

We perform our training on an Ubuntu 18 machine with single RTX 2070 GPU using 16bit precision.

Fast AI imports

In [1]:
from fastai.vision import *
from fastai.metrics import error_rate

# Import Libraries here
import os
import json 
import shutil
import zipfile

%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Set paths that you'll use in this notebook
root_dir = "/home/adeperio/"

# notebook project directories
base_dir = root_dir + 'ml/pcam/'
!mkdir -p "{base_dir}"
In [2]:
# set the random seed
np.random.seed(2)

Kaggle SDK/API and downloading dataset

The data we are using lives on Kaggle. We use Kaggle's SDK to download the dataset directly from there. To work with the Kaggle SDK and API you will need to create a Kaggle API token in your Kaggle account.

When logged into Kaggle, navigate to "My Account" then scroll down to where you can see "Create New API Token". This will download a JSON file to your computer with your username and token string. Copy these contents to you ~/.kaggle/kaggle.json token file.

In [ ]:
# First we make sure we have the kaggle sdk installed.
# We assume here that on your machine you have the kaggle.json token file in ~/.kaggle/
!pip install kaggle

Kaggle API: Download Competition Data

In [ ]:
# Can list available Kaggle competitions here if needed
# !kaggle competitions list

# Download the histopathological data
!kaggle competitions download -c histopathologic-cancer-detection -p "{base_dir}"
In [ ]:
#now unzip the training files
!mkdir -p "{base_dir}train/"
dest_dir_train = Path(base_dir + 'train/')
print(base_dir + 'train.zip')
train_zip = zipfile.ZipFile(base_dir + 'train.zip', 'r')
train_zip.extractall(dest_dir_train)
train_zip.close()
In [ ]:
#now unzip the test files
!mkdir -p "{base_dir}test/"
dest_dir_test = Path(base_dir + 'test/')
test_zip = zipfile.ZipFile(base_dir + 'test.zip', 'r')
test_zip.extractall(dest_dir_test)
test_zip.close()
  
In [ ]:
# then extract the labels 
dest_dir_csv = Path(base_dir)
labels_csv_zip = zipfile.ZipFile(base_dir + 'train_labels.csv.zip', 'r')
labels_csv_zip.extractall(dest_dir_csv)
labels_csv_zip.close()
In [4]:
# Check the download here
path = Path(base_dir)
path.ls()
Out[4]:
[PosixPath('/home/adeperio/ml/pcam/test'),
 PosixPath('/home/adeperio/ml/pcam/.ipynb_checkpoints'),
 PosixPath('/home/adeperio/ml/pcam/pcam-ml-ubuntu.ipynb'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50-unfrozen.pth'),
 PosixPath('/home/adeperio/ml/pcam/stage-1.pth'),
 PosixPath('/home/adeperio/ml/pcam/train_labels.csv'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50.pth'),
 PosixPath('/home/adeperio/ml/pcam/test.txt'),
 PosixPath('/home/adeperio/ml/pcam/models'),
 PosixPath('/home/adeperio/ml/pcam/train'),
 PosixPath('/home/adeperio/ml/pcam/sample_submission.csv.zip'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-34.pth'),
 PosixPath('/home/adeperio/ml/pcam/train_labels.csv.zip'),
 PosixPath('/home/adeperio/ml/pcam/train.zip'),
 PosixPath('/home/adeperio/ml/pcam/stage-1-50-1.pth'),
 PosixPath('/home/adeperio/ml/pcam/test.zip')]

Data Preparation

With our data now downloaded, we create an ImageDataBunch object to help us load the data into our model, set data augmentations, and split our data into train and test sets.

In [3]:
tfms = get_transforms(do_flip=True)
In [4]:
bs=64 # also the default batch size
data = ImageDataBunch.from_csv(
    base_dir, 
    ds_tfms=tfms, 
    size=224, 
    suffix=".tif",
    folder="train", 
    test="test",
    csv_labels="train_labels.csv", 
    bs=bs)

ImageDataBunch wraps up a lot of functionality to help us prepare our data into a format that we can work with when we train it. Let's go through some of the key functions it performs below:

Data Augmentation

By default ImageDatabunch performs a number of modifications and augmentations to the dataset:

  1. Centre crop the images
  2. There's also some randomness introduced on where and how it crops for the purposes of data augmentation
  3. It's important that all the images need to be of the same size for the model to be able to train on.

Image Flipping

There are various other data augmentations we could also use. But one of the key ones that we activate is image flipping on the vertical.

For pathology scans this is a reasonable data augmentation to activate, as there is little importance on whether the scan is oriented on the vertical axis or horizontal axis,

By default fastai will flip on the horizontal, but we need to turn on flipping on the vertical.

Batch Size

We'll be using the 1cycle policy (fit_one_cycle()) to train our network (more on this later). This is a hyper parameter optimisation that allows us to use higher learning rates.

Higher learning rates acts as a form of regularisation in 1cycle policy. Recall that a small batch size adds regularisation, so when using large batch sizes in 1cycle learning it allows for larger learning rates to be used.

The recommendation here is to use a batch size that is the largest our GPU supports when using 1cycle policy to train.

Training, validation and test sets

  1. We specify the folder location of the data (where the subfolders train and test exist along with the csv data)
  2. ImageDataBunch under the hood splits out the images (in the train sub-folder) into a training set and validation set (defaulting to an 80/20 percent split). There are 176,020 images in the training set and about 44,005 in the validation set.
  3. We also specify the location of the test sub-folder, that contains unlabelled images. Our learning model will measure accuracy and the error rates against this dataset
  4. The CSV file containing the data labels is also specified

Image size on base architecture and target architecture

Images in the target PCam dataset are square images 96x96. However, when bringing a pre-trained ImageNet model into our network, which was trained on larger images, we need to set the size accordingly to respect the image sizes in that dataset.

We choose 224 for size as a good default to start with.

Normalising the images

Once we have setup the ImageDataBunch object, we also normalise the images.

Normalising the images uses the mean and standard deviation of the images to transform the image values into a standardised distribution that is more efficient for a neural network to train on.

In [5]:
# now normalise the images
data.normalize(imagenet_stats)
Out[5]:
ImageDataBunch;

Train: LabelList (176020 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
1,0,0,1,1
Path: /home/adeperio/ml/pcam;

Valid: LabelList (44005 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
0,0,1,1,1
Path: /home/adeperio/ml/pcam;

Test: LabelList (57458 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: EmptyLabelList
,,,,
Path: /home/adeperio/ml/pcam

Below we take a look at some random samples of the data so we can get some understanding of what we are feeding into our network. This is a binary classification problem so there's only two classes (0 - 1)

In [8]:
data.show_batch(rows=4, figsize=(10, 10))