Use Case 7: Lymphoma Sub-Type Classification

This blog posts explains how to train a deep learning lymphoma sub-type classifier in accordance with our paper “Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases”.

Please note that there has been an update to the overall tutorial pipeline, which is discussed in full here.

This text assumes that Caffe is already installed and running. For guidance on that you can reference this blog post which describes how to install it in an HPC environment (and can easily be adopted for local linux distributions).


The NIA curated this dataset to address the need of identifying three sub-types of lymphoma: Chronic Lymphocytic Leukemia (CLL), Follicular Lymphoma (FL), and Mantle Cell Lymphoma (MCL). Currently, class-specific probes are used in order to reliably distinguish the sub-types, but these come with additional cost and equipment overheads. Expert pathologists specializing in these types of lymphomas, on the other hand, have shown promise in being able to differentiate these sub-types on H&E, indicating that there is the potential for a DP approach to be employed. A successful approach would allow for more consistent and less demanding diagnosis of this disease. This dataset was created to mirror real-world situations and as such contains samples prepared by different pathologists at different sites. They have additionally selected samples which contain a larger degree of staining variation than one would normally expect.

This use case represents the only classification use case of this manuscript: attempting to separate images into 1 of 3 sub-types of lymphoma. In the previous tasks, we were looking at primitives and attempting to segmented or detect them. In this case, though, a high level approach is taken, wherein we provide whole tissue samples to have the DL learn unique features of each class.


We break down this approach into 5 steps:

Step 1: Patch Extraction (Matlab): extract patches from all images separated into the 3 sub-types

Step 2: Cross-Validation Creation (Matlab): at the image level, split the patches into a 5-fold training and testing sets

Step 3: Database Creation (Bash): using the patches and training lists created in the previous steps, create 5 sets of leveldb training and testing databases, with mean files, for high performance DL training.

Step 4: Training of DL classifier (Bash):  Slightly alter the 2 prototxt files used by Caffe, the solver and the architecture to point to the correct file locations. Use these to train the classifier.

Step 5: Generating Output on Test Images (Python): Use final model to generate the output

There are, of course, other ways of implementing a pipeline like this (e.g., use Matlab to directly create a leveldb, or skip the leveldb entirely, and use the images directly for training) . I’ve found using the above pipeline fits easiest into the tools that are available inside of Caffe and Matlab, and thus requires the less  maintenance and reduces complexity for less experienced users. If you have a suggested improvement, I’d love to hear it!

Dataset Description

The dataset consist of 374 images of size 1388 x 1040. These are further broken down into 113 for the CLL class, 139 for the FL class and 122 for the MCL class. Unfortunately, there is no description with the data indicating if the prefix of the file name indicates a unique patient or a unique facility. They did indicate that the data has been curated from multiple sources to create a real-world type cohort which contains typical stain and scanning variances.

Regardless, create a valid comparison to wnd-chrm, we treat the images in the same way and assume that each image is from a unique patient. which also used this dataset,

The data is located here (1.4G).

Examples of these images can be seen below

CLL CLL-sj-03-5521_005 CLL-sj-03-852-R2_009 CLL-sj-03-476_001 CLL-sj-05-5269-R10_012
FL FL-sj-05-6124-R3_002 FL-sj-05-5311-R1_005 FL-sj-05-1881-R1_008 FL-sj-05-588-R1_007
MCL MCL-sj-05-1374_011 MCL-sj-05-901-R1_006 MCL-sj-04-4967-R2_009 MCL-sj-05-4179-R1_012

Step 1: Patch Extraction (Matlab)

We refer to step1_make_patches.m, which is fully commented.

A high level understanding is provided here:

  1. Since each image is representative of the entire class (as opposed to at a pixel level), we don’t have (or need) any annotation masks. So we simply break each image up into patches and assign them to the appropriate class.
  2. Each patch is saved to disk. At the same time, we maintain a “class_struct”, which contains all of the file names which have been written to disk.

  — > example CLL-sj-03-476_001_sub_1.png

The file name format for each patch is as follows:

Where u is the image ID, (CLL-sj-03-476_001),  which has the prefix of the class, followed by z which indicates what patch number it is. In this case, we haven’t kept track of the rotations, although there are two (0 and 90),  thus odd numbers are 0 degree rotation (z=1,3,5,…) and even numbers are 90 degree rotation (z=2,4,6,….)

Step 2: Cross-Validation Creation (Matlab)

Now that we have all of the patches written to disk, and we have all of the file names saved into patient_struct, we want to split them into a cross fold validation scheme. We use step2_make_training_lists.m for this, which is fully commented.

In this code, we use a 5-fold validation, for each fold, we create 4 text files. Using fold 1 as an example:

train_w32_parent_1.txt: This contains a list of the patient IDs which have been used as part of this fold’s training set. This is similar to test_w32_parent_1.txt, which contains the patients used for the test set. An example of the file content is:


train_w32_1.txt: contains the filenames of the patches which should go into the training set (and test set when using test_w32_1.txt). The file format is [filename] [tab] [class]. Where class is 0,1,2 for CLL, FL and MCL, respectively. An example of the file content is:

CLL-sj-03-2810_001_sub_1.tif 0
CLL-sj-03-2810_001_sub_2.tif 0
CLL-sj-03-2810_001_sub_3.tif 0
CLL-sj-03-2810_001_sub_4.tif 0
CLL-sj-03-2810_001_sub_5.tif 0

All done with the Matlab component!

Step 3: Database Creation (Bash)

Now that we have both the patches saved to disk, and training and testing lists split into a 5-fold validation cohort, we need to get the data ready for consumption by Caffe. It is possible, at this point, to use an Image layer in Caffe and skip this step, but it comes with 2 caveats, (a) you need to make your own mean-file and ensure it is in the correct format and (b) an image layer can is not designed for high throughput. Also, having 100k+ files in a single directory can bring the system to its knees in many cases (for example, “ls”, “rm”, etc), so it’s a bit more handy to compress them all in to 10 databases (1 training and 1 testing for 5 folds), and use Caffe’s tool to compute the mean-file.

For this purpose, we use this bash file:

We run it in the “subs” directory (“./” in these commands), which contains all of the patches. As well, we assume the training lists are in “../”, the directory above it.

Here we’ll briefly discuss the general idea of the commands, while the script has additional functionality (computes everything in parallel for example).

Creating Databases

We use the caffe supplied convert_imageset tool to create the databases using this command:

~/caffe/build/tools/convert_imageset -shuffle -backend leveldb   ./ DB_train_1

We first tell it that we want to shuffle the lists, this is very important. Our lists are in patient and class order, making them unsuitable for stochastic gradient descent. Since the database stores files, as supplied, sequentially, we need to permute the lists. Either we can do it manually (e.g., use sort –random) , or we can just let Caffe do it 🙂

We specify that we want to use a leveldb backend instead of a lmdb backend. My experiments have shown that leveldb can actually compress data much better without the consequence of a large amount of computational overhead, so we choose to use it.

Then we supply the directory with the patches, supply the training list, and tell it where to save the database. We do this similarly for the test set.

Creating mean file

To zero the data, we compute mean file, which is the mean value of a pixel as seen through all the patches of the training set. During  training/testing time, this mean value is subtracted from the pixel to roughly “zero” the data, improving the efficiency of the DL algorithm.

Since we used a levelDB database to hold our patches, this is a straight forward process:

~/caffe/build/tools/compute_image_mean DB_train_1 DB_train_w32_1.binaryproto -backend leveldb

Supply it the name of the database to use, the mean filename to use as output and specify that we used a leveldb backend. That’s it!

Step 4: Training of DL classifier (Bash)

Setup files

Now that we have the databases, and the associated mean-files, we can use Caffe to train a model.

There are two files which need to be slightly altered, as discussed below:

BASE-alexnet_solver.prototxt: This file describes various learning parameters (iterations, learning method (Adagrad) etc).

On lines 1 and 10 change: “%(kfoldi)d” to be the number of the fold for training (1,2,3,4,5).

On line 2: change “%(numiter)d” to number_test_samples/128. This is to have Caffe iterate through the entire test database. Its easy to figure out how many test samples there are using:

wc –l test_w32_1.txt

BASE-alexnet_traing_32w_db.prototxt: This file defines the architecture.

We only need to change lines 8, 12, 24, and 28 to point to the correct fold (again, replace “%(kfoldi)d” with the desired integer).

Also, in this use case, since we have 3 possible classes (CLL, FL, and MCL), we need to change line 173 from “num_output: 2” to “num_output: 3”

That’s it!

Note, these files assume that the prototxts are stored in a directory called ./model and that the DB files and mean files are stored in the directory above (../). You can of course use absolute file path names when in doubt.

In our case, we had access to a high performance computing cluster, so we used a python script ( to submit all 5 folds to be trained at the same time. This script automatically does all of the above work, but you need to provide the working directory on line 11. I use this (BASE-qsub.pbs)  PBS script to request resources from our Torque scheduler, which is easily adaptable to other HPC environments.

Initiate training

If you’ve used the HPC script above, things should already be queued for training. Otherwise, you can start the training simply by saying:

~/caffe/build/tools/caffe train –solver=1-alexnet_solver_ada.prototxt

In the directory which has the prototxt files. That’s it! Now wait until it finishes (600,000) iterations. 🙂

Step 5: Generating Output on Test Images (Python)

At this point, you should have a model available, to generate some output images. Don’t worry, if you don’t, you can use mine.

Here is a python script, to generate the test output for the associated k-fold (

It takes 1 command line arguments, the fold. In this case since we solely need to make a judgement for the entire image, we can compute a limited number of pixels, for example having a stride of 32 similar to our patch extraction technique.

The base directory is expected to contain:

BASE/images: a directory which contains the tif images for output generation

BASE/models: a directory which holds the 5 models (1 for each fold)

BASE/test_w32_parent_X.txt: the list of parent IDs to use in creating the output for fold X=1,2,3,4,5, created in step 2

BASE/DB_train_w32_X.binaryproto: the binary mean file for fold X=1,2,3,4,5, created in step 3

To compute the accuracy, and find out which images have been mis-classified, at a high level, this script: 

  1. Determines the actual class of the image, based off of the first letter of the file name: C,F,M
  2. Extracts patches and run them through the classifier, obtain their predicted class (0,1,2 for CLL, FL and MCL respectively)
  3. Computes their overall frequency as “votes” per class
  4. Takes the argmax of the frequencies to determine the overall predicted class for the image
  5. Update the confusion matrix

Final Notes

Efficiency in Patch Generation

Writing a large number of small, individual files to a harddrive (even SSD) is likely going to take a very long time. Thus for Step 1 & Step 2, I typically employ a ram disk to drastically speed up the processes.  Regardless, make sure Matlab does not have the output directory in its search path, otherwise it will likely crash (or come to a halt), while trying to update its internal list of available files.

As well, using a Matlab Pool (matlabpool open), opens numerous workers which also greatly speed up the operation and is recommended as well.


It is very important to use the model on images of the same magnification as the training magnification. This is to say, if your patches are extracted at 40x, then the test images need to be done at 40x as well.

Code is available here

Data is available here (1.4G)

24 thoughts on “Use Case 7: Lymphoma Sub-Type Classification”

    1. Not sure what to tell you, I just downloaded the data from an off-site server without a problem. Maybe try a different network?

  1. Can I have the “deploy_train32.prototxt” document? In step5 line22

  2. Hello,
    let me ask a quick question. Since the images look to have various color among classes, I’m wondering how you handled the color problem like color standardization.

    1. my experiments tend to support the notion that deep learning is agnostic to color/stain variation, and as such stain normalization is not needed

  3. is there any step for normalization ? For example, subtract the mean image of dataset or divide it on 225 ? or it is not necessary to perform any kind of normalization ? thanks

  4. Hi,
    if I want to use my own WSI to predict response to a drug(thus, the output is binary:response or non-response), is this pipeline here suitable for that?

    1. sure, i don’t see why not. you may find that you need to try a larger patch size and network, but overall this approach is essentially directly applicable

  5. Hi,
    did you rewrite the alexnet_solver.prototxt named as BASE-alexnet_solver.prototxt? Because when I google ‘/models/bvlc_alexnet/solver.prototxt’ and read it ,I could not find the lines 1 and 10 which is written with “%(kfoldi)d”

  6. Dear Janowczyk,
    Firstly, I am very pleasure to read your informative blog with interesting datasets and codes as well.
    However, I have two questions about lymphoma project.
    (1) Could I cite the original dataset to convince every one about the classification/diagnosis?
    (2) I don’t have any experience in using Caffe and have not install the package on my MacBook Pro yet! So I have not rerun your codes. I have trained the dataset using various models/architectures of Keras, and the best metrics scores of mine are reached 94.12% accuracy, and F1-score (CLL = 93%; FL = 95%; & MCL = 94%). So far I hope I can improve such scores with fine-tune of hyper parameters. Thus I’d like know how metrics scores of your lymphoma classification using Caffe are?
    Your sincerely.

      1. Dear Janowczyk,
        Many thank for your response. I have found your paper out. That’s all I need to refer.
        Best regards.

          1. Dear Janowczyk,
            I am apologize for your patience. I have one more question about the lymphomas dataset and your training models. First, as you told in the paper it seems to do at your beginning steps that you performed segmented images (by using MATLAB codes) then you trained your model with segmented one? Unfortunately, I am not familiar with MATLAB, so I use whole pictures to train my models and reach the metrics as I showed you before. So I wonder (of course if I understand you doing steps correctly) that I could achieve the higher metric scores.
            If you feel free, should you share me your segmented the lymphomas dataset?
            All the best

          2. Sorry, not sure i understand your question? The only data i have/used for this task is available on this site. I don’t have any additional lymphoma datasets

  7. Hi Andrew, my question is that did you segment images before using as input data for training models? Because I use directly the original dataset (without segmentation) to train my models.
    My purpose is to develop a real-time tool to detect classes of lymphomas. So that could we connect both images segmentation step (using MatLab, of course we can segment images using Python) and build pretrained data in ONE.
    Kindly. Linh

    1. sorry i’m not following, what segmentation? how does segmentation fit into the classification component of this use case?

      1. Hi Andrew,
        I am so sorry for your inconvenience. I have read your paper and blog throughout and re-run your codes as well. It is pity whilst my mac (included NVIDIA GTX 750M but can’t be used the GPU) thus time to re-train your codes takes 3 days over and have not been completed yet. Hopefully it’ll finish in next couple of days.
        Parallelly, I optimize my codes using Keras API with modified pre-trained weights of ResNet50, ResNet101, and ResNet152 models and result with accuracies 94.65%, 97.59% and 97.06%, respectively. The all training times take around 20 hours. I think it is not bad, if I go further with pair trains and write gaps of trains for 3 models to improve an overall accuracy.
        I am very appreciated and thank you.
        All the best

Leave a Reply

Your email address will not be published. Required fields are marked *