Use Case 3: Tubule Segmentation

This blog posts explains how to train a deep learning tubule segmentation classifier in accordance with our paper “Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases”.

Please note that there has been an update to the overall tutorial pipeline, which is discussed in full here.

This text assumes that Caffe is already installed and running. For guidance on that you can reference this blog post which describes how to install it in an HPC environment (and can easily be adopted for local linux distributions).


As mentioned in the paper, the goal of this project was to create a single deep learning approach which worked well across many different digital pathology tasks. On the other hand, each tutorial is intended to be able to stand on its own, so there is a large amount of overlapping material between Use Cases.

If you’ve already read the Use Case 1: Nuclei Segmentation, in this tutorial you can focus on “Step 1: Patch Extraction”, which contains all the differences from the nuclei tutorial.


The morphology of tubules is correlated with the aggressiveness of the cancer, where later stage cancers present with the tubules becoming increasingly disorganized. The Nottingham breast [22] cancer grading criteria divides scoring of the tubules into three categories according the area relative to a high power field of view: (i) > 75% (ii) 10%-75%, and (iii) <10%. The benefits of being able to identify and segment the tubules is thus two fold, (a) automate the area estimation, decreasing inter/intra reader variances, and (b) provide greater specificity which can potentially lead to better stratifications associated with prognosis indication.

Tubules are the most complex structures considered so far. They not only consist of numerous components (e.g., nuclei, epithelium and lumen) but the organizational structure of these components determines tubule boundaries. There is a very large variance in the way tubules present given the underlying aggressiveness and stage of the cancer. In benign cases, tubules present in a well-organized fashion with similar size and morphological properties, making their segmentation easier, while in cancerous cases, it is clear that the organization structure breaks down and accurately identifying the boundary becomes challenging, even for experts. To further compound the complexity of the situation, tubules as an entity are much larger compared to their individual components, thus requiring a greater viewing area to provide sufficient context to make an accurate assessment.


We break down this approach into 5 steps:

Step 1: Patch Extraction (Matlab): extract patches from all images of both the positive and negative classes

Step 2: Cross-Validation Creation (Matlab): at the patient level, split the patches into a 5-fold training and testing sets

Step 3: Database Creation (Bash): using the patches and training lists created in the previous steps, create 5 sets of leveldb training and testing databases, with mean files, for high performance DL training.

Step 4: Training of DL classifier (Bash):  Slightly alter the 2 prototxt files used by Caffe, the solver and the architecture to point to the correct file locations. Use these to train the classifier.

Step 5: Generating Output on Test Images (Python): Use final model to generate the output

There are, of course, other ways of implementing a pipeline like this (e.g., use Matlab to directly create a leveldb, or skip the leveldb entirely, and use the images directly for training) . I’ve found using the above pipeline fits easiest into the tools that are available inside of Caffe and Matlab, and thus requires the less  maintenance and reduces complexity for less experienced users. If you have a suggested improvement, I’d love to hear it!

Dataset Description

The dataset consist of 85 images ColoRectal images scanned at 40x., where each image is 775 x 522 and contains a total of 795 regions.  The tubules were manually annotated across all images by an expert pathologist.

The format of the files is:

09-21631-03-2.bmp: original H&E image

09-21631-03-2_anno.bmp: mask of the same size, where non-zero pixels indicate pixels belonging to a tubule

Each image is prefaced by a code (e.g., “09-21631-03”) to the left of the third dash (-), which coincides with a unique patient number. In this dataset, each patients has more than 1 image associated with them (~10 patients  = 85 images).

The data and the associated masks are located here (90M).

Examples of these images and their annotates can be seen below

tuble-10-13799-02-4 tuble-10-13799-02-4-sub

tuble-09-1339-05-2 tuble-09-1339-05-2-sub

The benign tubules, outlined in red, (top row) are more organized and similar. On the other hand, when considering malignant tubules (bottom row), the variances are quite large making it more difficult for a learn from data approach to generalize to all unseen cases.

Step 1: Patch Extraction (Matlab)

We refer to step1_make_patches.m, which is fully commented.

A high level understanding is provided here:

  1. Break the images into groups by patient ID number, and then sub-divide into benign versus malignant so we can sample them differently.
  2. For each image, load the image and its annotation and downsize both of them by 25% (10x apparent magnification). Generate the texture feature space, quickly train a naive Bayesian classifier and use its posterior probabilities to identify “hard” samples in both the positive and negative classes.
Original Image
Ground Truth
Hard Positives Mask
Hard Negatives Mask
  1. From each of these, randomly crop patches, save them to disk. At the same time, we maintain a “patient_struct”, which contains all of the file names which have been written to disk.

The file name format for each patch is as follows:

u_v_w_x_y_z.png   — > example 09-1339-01-1_anno_0_n_1_0.png

Where u is the patient ID, v is the image number of the patient, w is the class (0 (non-tubule) or 1 (tubule)), x is the type of patch (“n” is negative and “p” is positive, redundant with w in this use case), y is the number of the patch (1 to 2000), and z is rotation (0, 90, 180, 270 in this case).

Step 2: Cross-Validation Creation (Matlab)

Now that we have all of the patches written to disk, and we have all of the file names saved into patient_struct, we want to split them into a cross fold validation scheme. We use step2_make_training_lists.m for this, which is fully commented.

In this code, we use a 5-fold validation, which is split at the patient level. Again, remember splitting at the image level is unacceptable if multiple images can come from the same patient!

For each fold, we create 4 text files. Using fold 1 as an example:

train_w32_parent_1.txt: This contains a list of the patient IDs which have been used as part of this fold’s training set. This is similar to test_w32_parent_1.txt, which contains the patients used for the test set. An example of the file content is:


train_w32_1.txt: contains the filenames of the patches which should go into the training set (and test set when using test_w32_1.txt). The file format is [filename] [tab] [class]. Where class is either 0 (non-tubule) or 1 (tubule). An example of the file content is:

09-1339-01-1_anno_0_n_1_0.png 0
09-1339-01-1_anno_0_n_1_90.png 0
09-1339-01-1_anno_0_n_2_0.png 0
09-1339-01-1_anno_0_n_2_90.png 0
09-1339-01-1_anno_0_n_3_0.png 0

All done with the Matlab component!

Step 3: Database Creation (Bash)

Now that we have both the patches saved to disk, and training and testing lists split into a 5-fold validation cohort, we need to get the data ready for consumption by Caffe. It is possible, at this point, to use an Image layer in Caffe and skip this step, but it comes with 2 caveats, (a) you need to make your own mean-file and ensure it is in the correct format and (b) an image layer can is not designed for high throughput. Also, having 100k+ files in a single directory can bring the system to its knees in many cases (for example, “ls”, “rm”, etc), so it’s a bit more handy to compress them all in to 10 databases (1 training and 1 testing for 5 folds), and use Caffe’s tool to compute the mean-file.

For this purpose, we use this bash file:

We run it in the “subs” directory (“./” in these commands), which contains all of the patches. As well, we assume the training lists are in “../”, the directory above it.

Here we’ll briefly discuss the general idea of the commands, while the script has additional functionality (computes everything in parallel for example).

Creating Databases

We use the caffe supplied convert_imageset tool to create the databases using this command:

~/caffe/build/tools/convert_imageset -shuffle -backend leveldb   ./ DB_train_1

We first tell it that we want to shuffle the lists, this is very important. Our lists are in patient and class order, making them unsuitable for stochastic gradient descent. Since the database stores files, as supplied, sequentially, we need to permute the lists. Either we can do it manually (e.g., use sort –random) , or we can just let Caffe do it 🙂

We specify that we want to use a leveldb backend instead of a lmdb backend. My experiments have shown that leveldb can actually compress data much better without the consequence of a large amount of computational overhead, so we choose to use it.

Then we supply the directory with the patches, supply the training list, and tell it where to save the database. We do this similarly for the test set.

Creating mean file

To zero the data, we compute mean file, which is the mean value of a pixel as seen through all the patches of the training set. During  training/testing time, this mean value is subtracted from the pixel to roughly “zero” the data, improving the efficiency of the DL algorithm.

Since we used a levelDB database to hold our patches, this is a straight forward process:

~/caffe/build/tools/compute_image_mean DB_train_1 DB_train_w32_1.binaryproto -backend leveldb

Supply it the name of the database to use, the mean filename to use as output and specify that we used a leveldb backend. That’s it!

Step 4: Training of DL classifier (Bash)

Setup files

Now that we have the databases, and the associated mean-files, we can use Caffe to train a model.

There are two files which need to be slightly altered, as discussed below:

BASE-alexnet_solver.prototxt: This file describes various learning parameters (iterations, learning method (Adagrad) etc).

On lines 1 and 10 change: “%(kfoldi)d” to be the number of the fold for training (1,2,3,4,5).

On line 2: change “%(numiter)d” to number_test_samples/128. This is to have Caffe iterate through the entire test database. Its easy to figure out how many test samples there are using:

wc –l test_w32_1.txt

BASE-alexnet_traing_32w_db.prototxt: This file defines the architecture.

We only need to change lines 8, 12, 24, and 28 to point to the correct fold (again, replace “%(kfoldi)d” with the desired integer). That’s it!

Note, these files assume that the prototxts are stored in a directory called ./model and that the DB files and mean files are stored in the directory above (../). You can of course use absolute file path names when in doubt.

In our case, we had access to a high performance computing cluster, so we used a python script ( to submit all 5 folds to be trained at the same time. This script automatically does all of the above work, but you need to provide the working directory on line 11. I use this (BASE-qsub.pbs)  PBS script to request resources from our Torque scheduler, which is easily adaptable to other HPC environments.

Initiate training

If you’ve used the HPC script above, things should already be queued for training. Otherwise, you can start the training simply by saying:

~/caffe/build/tools/caffe train –solver=1-alexnet_solver_ada.prototxt

In the directory which has the prototxt files. That’s it! Now wait until it finishes (600,000) iterations. 🙂

Step 5: Generating Output on Test Images (Python)

At this point, you should have a model available, to generate some output images. Don’t worry, if you don’t, you can use mine.

Here is a python script, to generate the test output for the associated k-fold (

It takes 2 command line arguments, base directory and the fold. Make sure to edit line 88 to apply the appropriate scaling; in this case reduce 20x images to 10x images by resizing by 1/2.

The base directory is expected to contain:

BASE/images: a directory which contains the tif images for output generation

BASE/models: a directory which holds the 5 models (1 for each fold)

BASE/test_w32_parent_X.txt: the list of parent IDs to use in creating the output for fold X=1,2,3,4,5, created in step 2

BASE/DB_train_w32_X.binaryproto: the binary mean file for fold X=1,2,3,4,5, created in step 3

It generates 2 output images for each input. A “_class” image and a “_prob” image. The “_prob” image is a 3 channel image which contains the likelihood that a particular pixel belongs to the class. In this case, the Red channel represents the likliehood that a pixel belongs to the non-tubule class, and the green channel represents the likelihood that a pixel belongs to the tubule class. The two channels sum to 1. The “_class” image is a binary image using the argmax of the “_probs image”.

Notice that the output masks are 193 x 130 or 1/4 the size of the input masks. This is very important, we trained the classifier at 10x, so we need to test it at 10x. Since the images are all taken at 40x, we down-sample them to obtain the correct apparent magnification.

tuble-10-13799-02-4-sub tuble-10-13799-02-4_prob-sub
tuble-10-13799-02-4 tuble-10-13799-02-4_prob
tuble-09-1339-05-2-sub tuble-09-1339-05-2_prob-sub
tuble-09-1339-05-2 tuble-09-1339-05-2_prob

Typically, you’ll want to use a validation set to determine an optimal threshold as it is often not .5 (which is equivalent to argmax). Subsequently, use this threshold on the the “_prob” image to generate a binary image.

Final Notes

Efficiency in Patch Generation

Writing a large number of small, individual files to a harddrive (even SSD) is likely going to take a very long time. Thus for Step 1 & Step 2, I typically employ a ram disk to drastically speed up the processes.  Regardless, make sure Matlab does not have the output directory in its search path, otherwise it will likely crash (or come to a halt), while trying to update its internal list of available files.

As well, using a Matlab Pool (matlabpool open), opens numerous workers which also greatly speed up the operation and is recommended as well.

Efficiency in Output Generation

It most likely will take a long time to apply a the classifier pixel wise to an image to generate the output. In actuality, there are many ways to speed up this process. The easiest way in this case is to simply use a larger stride such that you compute every 2nd or 3rd pixel since tubule segmentation doesn’t require nearly as much precision, as say, nuclei segmentation.

Keep an eye out for future posts where we delve deeper into this and provide the code which we use!


It is very important to use the model on images of the same magnification as the training magnification. This is to say, if your patches are extracted at 10x, then the test images need to be done at 10x as well.

Code is available here

Data is available here (90M)

42 thoughts on “Use Case 3: Tubule Segmentation”

  1. Dear Dr. Andrew,

    I am the grade 2 Phd candidate from Beijing university of posts and telecommunications, China. I am keen on the pathological image analysis and deep learning research. your tutorials are well done jobs. benign.txt and malignant.txt are missing, could you share them with me? Thank you! My email is

    I’m looking forward to your earliest reply.

    Sincerely yours,
    Kevin Pan

    1. Thank you for your comment and pointing out my oversight. I have updated the tgz file to now contain the 2 text files.

  2. Hello!
    This is really great, just wondering if the code is compatible to run on windows? I’ve finally setup Caffe on windows and realise this might be for a different platform..

    1. I’ve never used it on windows, but off the top of my head, I don’t see why it wouldn’t work. Give it a try and let me know!

      1. Thanks for the quick reply! this sounds really excited!

        I’m actually stuck at step 3 when I run $ sh C:/projects/dl-tutorial-code/common/ in the tubule/subs folder, using ”mingw64 git bash” and it says C:/projects/caffe/build/tools/computer_image/mean: no such file or directory.

        Sorry I’m new to executing sh and py script, I have a feeling that I’m making a basic mistake, is there a tutorial I can look at?

  3. Hi Andrew, I think it can definitely able to run on windows but I’ve been having some issue to get to the end. Then I realised there was some minor errors from the beginning.
    I’ve got exceptions for ”colour_deconvolution”, have you got a matlab file for this function? it’s missing from the github package.
    Best wishes

  4. Hi Andrew, I may have spotted another minor error:
    in step 1, ilf=cat(3,io,ilf);
    I’ve got another exception undefined ‘ilf’ so I guess this should be:
    ilfo=cat(3,io,ilfo) ?

    1. ah yes! good eye! I think in the original version I actually used a secondary script to make all of the texture mat files and named them slightly differently

      1. some quick updates, so far I’ve tested step 1,2,3,5 on windows 10, python 3.5.2, cuda 8 and produced the expected output, there are two some more minor things may will be useful for future users:
        in step 2 the txt was created with a format (‘%s\t%d\n’)
        for some reason step 3 wouldn’t recognise the tabs so I replaced with a space and it worked!
        Step 5 python 3 doesn’t recognise xrange anymore and I changed it to ‘range’ is used instead.
        I’m still stuck at step 4 and keep getting the error message ‘Check failure stack trace’ without any error details. I suspect there is something wrong with my training parameters will let you know how it goes 😉

          1. this is awesome! Thanks !I was worried about the speed of this version and you already had a solution 😀

        1. Hi Ding,
          I’m trying to build pycaffe on python 2.7 on Windows 10+Visual Studio 2015. I’ve put all the components in the ‘caffe-windows/python’ folder into my ‘python27/Lib/site-packages.’ When I tried importing caffe from python, I got the following error:

          Traceback (most recent call last):
          File “”, line 1, in
          import caffe
          File “C:\Python27\lib\site-packages\caffe\”, line 1, in
          from .pycaffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, RMSPropSolver, AdaDeltaSolver, AdamSolver, NCCL, Timer
          File “C:\Python27\lib\site-packages\caffe\”, line 13, in
          from ._caffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, \
          ImportError: DLL load failed: %1 is not a valid Win32 application.

          Steps 1-4 work perfectly. But pycaffe won’t seem to work. How did you get pycaffe working on Windows 10?

  5. Hello Andrew,
    It’s been a few exciting days and I’ve successfully applied your method to my images, and the segmentation results look awesome!
    I also tried to integrate the case 3 example with the new pipeline, the highest training accuracy was a amazing 92%. however the final output seems a bit odd. one reason I found was because the original new pipeline work on images with intensity of 0-255; for the tubule examples, the images become double 0-1 and the output from ‘net_full_conv.forward_all’ become all ones.
    after playing with the python script (sry I’m still new to python) I found out if I change transformer.set_raw_scale(‘data’, 255.0) to transformer.set_raw_scale(‘data’, 1.0), the output images start showing some grayscale but still not completed correct.
    Do you know is there anywhere else I need to change to get the correct output? 🙂
    thanks a lot!

    1. Ultimately, it doesn’t matter if its [0,255] or [0,1], as long as the associated mean file is aligned to the same range. Also, I think the caffe wrapper also does some conversion when it sees that you’re using uncompressed image data; keep in mind the LMDB also supports compressed images, which are really just a stored byte stream which is decompressed in real-time. That said, i’ve never paid particular attention to the range and use the same basic scripts for everything. Since you have a working example now, you can align the new dataset format to the old dataset style and should be ok.

      1. Hi Andrew, thanks a lot for it, It’s working now!
        Another thing, I was browsing your github and noticing there is no license.txt or copyright.txt etc. maybe you should put one? or please let me know if you have one 🙂

  6. Dear Dr. Andrew,

    I am following your tutorials and I have a question: To evaluate the F1-measure (F1=0.83 in your paper), we should an cutoff about the ‘distance from the true label’. How far ‘a predicted positive pixel’ from ‘a labeled positive pixel’ is judged as a true positive?

    Sincerely yours,

    1. 0 pixels, it either or isn’t 😉 for example, in matlab, TP is defined as TP=nnz(dlob&io);, where dlob is the binary mask for the deep learning and io is the ground truth mask.

      1. OK, thank you very much. I followed the whole process and got F1=0.62, while your result is 0.83. Could you please provide some hints result in this big gap? How should I refine the results?

        1. the most common error i usually find is that people are using the models to produce output at the wrong magnification or have incorrectly selecting the appropriate probability to threshold the positive class. I’d recommend checking those items first

  7. Dear Dr. Andrew,
    I split my svs file into 775 x 522 images at 40X, but there are few tubules in the images.(tubules also larger than your data)
    Have I set wrong resolution?
    Thank you!

    1. Maybe. try downloading my tubule dataset, as well as looking at the associate tutorial (i might have resized it in code on the fly?) and measuring the size of the tubules in pixels. then you can know how much to scale your own images

      1. I measured 36 tubules from my dataset at 20X, and 36 tubules from your tubule dataset. My dataset(mean: 10917.19 pixel , sd: 4640.439), your dataset(mean: 12486.58, sd: 3846.503). Then I perform t test, and the result shows there no significant difference in means of two dataset(p-value = 0.1229). So I think your tubule dataset maybe scanned at 20X, not 40X.

        1. that sounds reasonable. don’t forget to look at the code itself, “io=imresize(io,.25,’near’); %resize the images”, so whatever the input magnification, the patches are being abstracted from a version 1/4th the size. ultimately you just need to make sure that there is enough “context” present in the extracted patch to determine if there is a tubule present or not. too high of a magnification and the only thing that will be present is a single nuclei (not enough context to govern if its part of a tubule or not) and too low of a magnification and the resolution will be so poor it’ll be impossible to see the tubule

  8. Dear Dr. Andrew
    In the first step, we need to import ‘benign.txt’ and ‘malignant.txt’, but i did not find them in the data file downloaded from the website. where can I find them?
    very thanks

  9. Hi Andrew,
    I am following your tutorials, so far i finished successfully the 3rd Step, but I am stuck with Step 4 with a technical issue:
    I am running it without an HPC, therefor tried to run the
    “1-alexnet_solver_ada.prototxt” command in Bash (in the model directory), but got the error “command not found” for each line of the code. what am i doing wrong?

    1. the prototxt is *not* a script, but a description of the network. if you want to run individual folds, you should use the caffe-train command from the “initiate training” section

    1. Hi, I’m not sure how to help you, did you specifically look for pixels which aren’t the value of “0”? the annotations are numbered numerically starting with “1”

        1. compare the difference between the output in matlab for these 2 output commands:

  10. Hi Andrew,

    I am searching for papers that using deep learning and apply segmentation to your tubule and epithelium datasets since I also use your both datasets and need to compare it with recent studies. But I cannot find any. Do you have any papers uses your datasets?

    Thank you,

    1. Sorry, unforunately it not possible for me to keep track of who/when these datasets get used. Since anyone using the data should be citing the manuscript, you can try looking through that list.

Leave a Reply

Your email address will not be published. Required fields are marked *