Transferring data FASTER to the GPU With Compression

Utilization of current GPUs is often limited by the ability to get the data onto and off the device quickly. More precisely, this means taking data from the host RAM, transferring it over the PCI-e bus to the GPU RAM is the bottleneck of many deep learning use cases.

While newer computers/architectures are aiming to reduce this bandwidth limitation via reorganizing and optimizing CPU/GPU communication (e.g., GPUDirect Storage), consumer-grade computers with common GPUs (e.g., RTX 2080) won’t receive these benefits.

As such, a fair question would be to ask if there is anything we can do on the software side to improve overall throughput. Nvidia has made some investments in this area by providing a number of libraries. In particular, here I will show an example of nvJPEG.

Simply stated, we can improve overall data transmission throughput by simply sending less, more dense, data over that bridge. This shouldn’t be too surprising as this approach has been used for years and is the basis for the creation of compression algorithms. Same data, but less space.

In this case, nvJPEG allows for the loading of an original binary jpeg image into host ram using Next, via the function decode_jpeg, this information is sent in its compressed data stream to the GPU, where it is decompressed on the device itself.

This hopefully makes some intuitive sense, for example if we take a 2000 x 2000 RGB image like this, saved in raw format would require 12MB, and transferring it as a matrix to a GPU would require the same amount of data to be moved across the bridge. On the other hand, if we can accept JPG compression, at an 80% quality value, this image now becomes a 1MB file, or about an order of magnitude smaller!

Here is a brief example of that:


55.1 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

at =, device=device)

28.6 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here we can see nearly a 2x speed improvement. What if we further compress this image from 80% quality down to 70% quality:

--- make a more highly compressed version
cv2.imwrite('10279_500_f00182_original_highly.jpg',data,[int(cv2.IMWRITE_JPEG_QUALITY), 70])

!ls -lh *.jpg

1017K 10279_500_f00182_original.jpg 
683K  10279_500_f00182_original_highly.jpg 


42.9 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

at =, device=device)

16.5 ms ± 82.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Now with a ~30% smaller file, we see a speed improvement of 2.6x!

Interestingly, the major assumption here is that your data is compressible, if we randomly generate an image, it will not compress as well as a “natural” image, in this case at a quality of 80% we reduce the file from 112MB to 64MB, or about half. As a result, the speed-up is less pronounced than above, but still impressive at 50% reduction:

cv2.imwrite('output_high.jpg',a,[int(cv2.IMWRITE_JPEG_QUALITY), 80])


1.81 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

img =, device=device

1.09 s ± 8.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

One can further decrease the compression quality, and the speedup will again improve, but it is important to note that this is lossy compression, and the fidelity of the image (especially random images) very quickly becomes distorted and unacceptable. Testing here for your specific use cases will be critical, as well as understanding how compression artifacts may affect your DL models.

We have a comprehensive paper on that topic: Quantitative Assessment of the Effects of Compression on Deep Learning in Digital Pathology Image Analysis

While we often don’t load jpeg files directly from disk, it does make me wonder if there are storage mechanisms (e.g., LMDB) that can be employed which allow for storage on disk and decompression on the GPU device, in a similar fashion as this.

One issue remains, though, that in most common deep learning approaches, one would also like to perform augmentation of the images (rotation, color, etc), which are not as easily replicated on the GPU and thus are currently still limited to the CPU space. That said, though, there remains a growing trend to do more of this preprocessing activity on the GPU, further motivating the potential for transferring compressed data to the GPU memory.

Note that while the nvJPEG library exists as a separate entity, it has only been integrated into pytorch in version 1.9.1.

Working jupytext example of this code available here.

Thank you to Prof Lee Cooper for inspiring this post!

Leave a Reply

Your email address will not be published. Required fields are marked *