Download TCGA Digital Pathology Images (FFPE)

Digital pathology image analysis requires high quality input images. While there are a large number of images available in The Cancer Genome Atlas (TCGA), the ones which are currently available in the data portal are frozen specimens and are *not* suitable for computational analysis. This post discusses how to download the Formalin-Fixed Paraffin-Embedded (FFPE) slides for corresponding patients.

First a brief introduction, the TCGA offers two types of slides, flash frozen and Formalin-Fixed Paraffin-Embedded (FFPE). Flash frozen samples are typically produced during surgery in a cryolab to help the surgeon determine if the borders of the tumor are clean( i.e., has the tumor been fully resected). Flash freezing is a fast and “easy” process, but frequently leaves the tissue damaged, giving it a swiss cheese type appearance:

frozen

FFPE slides are the gold standard for diagnostic medicine, and are generated by fixing a specimen in formaldehyde and then embedding it in a paraffin wax block for cutting.  It has a much nicer appearance, making it more amenable to computational analysis:

ffpe

A more full discussion is available here and here.

The TCGA has both types of slides available, so care must be taken to obtain the correct cohort and *not* mix cohorts unless specifically part of your experimental design.

The difference can be found by looking at the particular filename, where files with “TS#” or “BS#”, where # is an integer, is a frozen slide, like this:

TCGA-CH-5765-11A-01-TS1.2a1faf76-526b-4581-b947-e8d733674df7.svs

While files with “DX#”, again where # is an integer, is an FFPE slide:

TCGA-14-0786-01Z-00-DX2.9dd57cfe-f467-4796-a491-48b737a6248c.svs

To perform the download, we need two components, (1) the TCGA download tool, and (2) a manifest file which states using precise id numbers which files to download.

First we need to go to the TCGA data portal, located here: https://portal.gdc.cancer.gov

Then we click on “Repository”:

2018-08-01 14_49_57-GDC

Then click on “slide image” under “Data type”

2018-08-01 14_50_21-Repository

Then “Diagnostic Slide” under “Experimental Strategy”

2018-08-01 14_50_47-Repository

This produces a list of slides, all of which have the “DX#” sting in their filename:

2018-08-01 14_51_43-Repository

We can limit to a specific organ group by clicking,  e.g., Cases, and then breast:

2018-08-01 14_52_27-Repository

Now we have the 1,133 files that we would like to download. We do this by clicking “add all files to cart” (or selecting the ones we are interested in):

2018-08-01 14_53_20-Repository

Lastly, we go to the cart and select download – > manifest:

2018-08-01 14_54_03-Cart

 

This provides us with a txt file that we can feed to the gdc-client:

gdc-client download -m gdc_manifest_20180801_125430.txt

Thats it!

63 thoughts on “Download TCGA Digital Pathology Images (FFPE)”

        1. Sorry i’m not familiar with the digital slide archive. it is not affiliated with the TCGA. If you would like access to their site, I would suggest emailing them. if you would like clinical data from the TCGA, you may look here: https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22clinical%22%5D%7D%7D%5D%7D

          1. Thanks for the reply 🙂

            What I really want is access to the reports for processing with natural language. Would you have a dataset that you know that provides something like this?

          2. hmm sorry nothing immediately comes to mind, its a bit outside of my expertise. these are hard to get without a direct collaboration with a hospital given their highly sensitive nature. not sure how much data in that context will be made freely publically available

    1. either need to make yourself or find a published paper which has used them and ask them for whatever annotations you’re interested in

  1. Is there any formal document from GDC mentioned that files with “TS#” or “BS#” are frozen slides, and files with “DX#” are FFPE slides? I find some files with “TSA” or “TSB”, and don`t know what they mean, so I am really confused.

    1. i dont know of any, if you find one please let me know : ) the TS and BS stand for “top slide” and “bottom slide” and are used during surgery to ensure that resection has clean boundaries. since the patient is still on the operating table, these are always flash frozen. “diagnostic” slides by definition are FFPE. this can be seen when looking at the data portal under “experimental” strategy, there are two options “tissue slide” (frozen) and “diagnostic slide” (ffpe). not sure if there will be a formal document explicitly saying this since its fairly routine practice to my knowledge

  2. Hi Andrew, many thanks for this. I am interested in playing around with DL methods and have couple of questions. How do you convert svs files to tiff (or any other format)? Which image format do you prefer to work with? How do you tile the images (and how many tiles do you create)? Thanks

    1. thank you for your questions. depending on the user case, no conversion may be necessary as its possible to load particular regions of the image directly using either openslide or matlab (examples of that are on this blog and https://github.com/choosehappy/HistoQC). ultimately, if the experiment is going to be repeated often only on particular regions of interest, i do prefer extracting those regions of interest as high quality png/tif files so that they’re easier to access. unfortunately, how to tile and how many to tile are very dependant on the use case and the amount of data available so there is no real hard and fast rule. in general, enough so that the DL can learn, but not too many that it takes ages to train for little added value : )

  3. Beside, ‘BS’ and ‘TS’, some slides are ‘MS’. I suppose those are ‘Middle Section’ frozen samples. Is that correct?

    1. Yes. TCGA wants to make sure the slide sections are consistent, and made top/bottom/middle sections at the beginning. Later they found out that there are not much needs to keep 3 sections, and scaled down to only 2

  4. Hi Andrew,

    I was wondering if you had already tried to correlate finding of pathology and radiology on the image data from TCGA ? Whether for pathology the images are well annotated and using the ID you can link the finding of the diagnostic slides to the associated clinical data, on the radiology we are having some troubles. Very often there are multiple visits for a patient (hence multiple series) and for each visit dates are randomized so we do not know which is the diagnostic visit.

    If you have tried to link the pathology to the radiology and can give us some hints that would be greatly appreciated.

    1. Thanks for your question. I think its an interesting avenue to pursue, but unfortunately have not done so myself and thus don’t have any advice to give you :-\

      1. Yeah indeed it is very interesting but also challenging for a lot reasons. We will keep digging on the online data from TCGA …

        Cheers

  5. Hi,

    I’d like to download Immunohistochemistry (IHC) images from TCGA. It doesn’t seem to be trivial to search the database. I’ve been trying for days now, but no luck. Any help would be much appreciated!

    Thanks,
    Ali

  6. Hi Andrew,

    Are all of the FFPE slides in TCGA stained with H&E? I can’t seem to find this info anywhere, and I suspect the answer is yes, but I just want a confirmation. I’ve become paranoid that some of them are H-DAB, since QuPath, the software I’m using to look at the whole slide images, detects several of them to be H-DAB (but to me they look like H&E).

    Is the staining info available somewhere in the TCGA metadata?

    Thanks,
    Adam

    1. I would be very careful and not make that assumption without visual verification. i know for a fact some of the ffpe samples are *actually* frozen samples, which are mislabeled. This tool we built helps to address to find these slides and those with artifacts. To my knowledge there is no staining info, keep in mind the diagnostic slides weren’t originally intended to be used for computational analysis, but were instead intended to be interpreted by humans for estimations of e.g., tumor purity, to place the -omics modalities data into better context

  7. Hello,

    I would like to train a model using Frozen images from TCGA (tumor and non-tumor, as done in https://www.nature.com/articles/s41598-019-46718-3) but to predict on FFPE images. What is your opinion on doing that?
    I’ve seen some articles doing similar things:
    (1) https://openaccess.city.ac.uk/id/eprint/21373/1/J2019_Kather_PredictingSurvivalColorectal_PLOSMedicin.pdf => they train with frozen images and validate on an external cohort with FFPE images
    (2)
    https://www.nature.com/articles/s41591-019-0462-y => trained in FFPE and tested in FFPE has an AUC of 0.84; when trained in frozen and tested in FFPE AUC drops to 0.61 (27% drop)

    Would you have any other references?

    Thank you a lot for your help,

    1. I would avoid at all costs unless the particular experimental design calls for it. Ultimately flash-frozen tissue has a significantly different presentation due to tissue damage imparted during the freezing process. I would thus always make attempts to align modalities so that the learning + classifier will be robust to realistic variabilities, and not struggle to compensate between different modalities.

    1. Sorry, not entirely sure what you’re asking. what command are you using? perhaps the service is unavailable at the moment?

      1. I use this command
        ”gdc-client download -m gdc_manifest_20210506_014815.txt”
        to download diagnostic slide files. But encountered ”requests.exceptions.HTTPError: 451 Client Error: UNAVAILABLE FOR LEGAL REASONS for url:”
        the problem.

        1. Sorry, no idea. Keep in mind some of the files may not be available unless you have an authorized GDC account

          1. I tried individually using different UUIDs but the error is the same. Looks like they found any legal issue with HIPPA after making it public.

          2. That is very well possible, you can try contacting the TCGA folks directly, perhaps they can provide more details. Would be interesting to know what hiccups they faced along the way

        2. Recently many of these files were discovered to contain PHI in the label portion of the slide including dates and patient names. For this reason they were taken offline. I imagine they will come back in the future once the labels have been scrubbed (not sure why they were included in the first place – only bad things can happen).

  8. Hi,
    I am not able to download TCGA, I tried all possible methods. Ping works, other metadata download works, but WSI is having some problem. Here is the message.

    ERROR: 0001a1fb-f388-41c6-bfe9-ecbb10429e37: 451 Client Error: UNAVAILABLE FOR LEGAL REASONS for url: https://api.gdc.cancer.gov/data/0001a1fb-f388-41c6-bfe9-ecbb10429e37: {“message”:”Request contains a redacted file(s): [‘0001a1fb-f388-41c6-bfe9-ecbb10429e37’], action not allowed”}

    1. Take a look at the last line of the blog post : ) Just provide the filename directly

      gdc-client download -m gdc_manifest_20180801_125430.txt

  9. Hello,

    are any annotations available for this datasets? If so, how can they be used to draw the annotations in the original svs files?

    Thank you,
    Olga

    1. Some folks may have made them available as part of their manuscripts, but i unfortunately don’t know of any single repository :-\

      1. Hello

        Do you know how to screen this database for cases of single gene mutations? Like the ALK mutation in lung cancer. Would you like to share it with me?
        Thank you very much!
        yangyu

  10. Hi there. Thanks for your nice tips to find those images. Now after downloading that manifest how I should get the actual images? I saw something like GDC Data Transfer Tool but I am not sure how to use it.

  11. First, thanks for this great post.
    I am going to open the “manifest” file eventually with qupath. Could you please introduce an instruction to me to do so? I am new to qupath.

    Thanks,
    Mas

    1. the manifest file and qupath are unrelated. you’ll need to use the the gdc-client to download the WSI, and then you can open them in qupath

  12. Great instructions. Are the TCGA slides citeable in the sense of incorporating slide content (e.g. a vascular invasion) into an own manuscript? Really exempt from copyright? Many thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *