Serialization and storage of GeoJson in Digital Pathology

GeoJSON, a widely used format based on JSON (JavaScript Object Notation), is specifically designed for encoding a variety of geographic data structures. This versatile format excels in representing simple geographical features, such as points, lines, and polygons, along with their non-spatial attributes. In the realm of digital pathology, GeoJSON has emerged as a common format for storing annotations, enabling precise documentation of regions of interest, cellular structures, and other critical details within pathology images. The popularity of GeoJSON in this field is bolstered by its broad support across numerous tools (e.g., Qupath) and thus facilitates seamless integration and analysis in digital pathology workflows.

Despite its widespread adoption, there are several open questions regarding the efficient use of GeoJSON that can significantly impact performance. One key concern is the best method for storing GeoJSON in a compressed format to minimize storage requirements while preserving the integrity of the data. Efficient compression techniques are crucial, especially when dealing with large-scale pathology datasets.

Another important consideration is the speed of serializing GeoJSON objects to strings, a process necessary for tasks such as database insertion or file output. Rapid serialization ensures that digital pathology systems can handle large volumes of annotations without introducing significant delays, thereby maintaining smooth and responsive operations.

Addressing these performance-related challenges is essential for optimizing the use of GeoJSON in digital pathology.  In particular, as we’re in the process of scaling up our Histotools Suite (i.e., HistoQC for quality control, QuickAnnotator for rapid segmentation, Patchsorter for rapid labeling, and CohortFinder for optimal data set discovery), we’re evaluating and selecting a combination of technologies which enable operating at the digital pathology repository scale. We thought we’d share some of those tests and decisions with you to help guide your own decision making process.

To perform these tests we’re using a 310MB geojson plain text file, available compressed here, which contains annotations from our recent kidney tubule preprint. This particular collection contains 88,605 polygons from tubules, tubular basement membrane, lumen, nuclei, etc (as described in the manuscript). An example screen shot is here:

Below we’ll run through some common operations such as loading, writing, and serializing, and looking at the associated timings needed.


For the purposes of these tests, in the top part of this blog post we’re using a ThinkPad T460p laptop, which has an Intel I7-6700HQ processor and NVMe harddrive. In the results section, we as well include timings for a large server – with 2x Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz, 256GB of ram, as well as an NVMe harddisk.

1         Loading and Saving

1.1         Baseline

The most basic thing we may want to do is to load the geojson, which we can do simply with these two lines and results in loading the file in 6.95 s ± 313 ms:

  1. with open('input.json', 'r') as file:
  2.     data = json.load(file)

We can as well write it to back to file in like so, which takes 1min 39s ± 1.32 s:

  1. with open("outfile.json", 'w') as outfile:
  2.     json.dump(data,outfile)

1.2         Compressed

Since Gzip works on byte streams, we write the json compressed like so, which takes 3min 44s ± 2.37 s:

  1. import gzip
  2. with gzip.open("outfile.json.gz", 'wt', encoding="ascii") as zipfile:
  3.         json.dump(data, zipfile)

and read it in like so in 7.25 s ± 105 ms:

  1. import gzip
  2. with gzip.GzipFile("outfile.json.gz", 'r') as f:
  3.     data = json.loads(f.read())

While the geojson file was reduced from 310MB to 52MB, or a 84% reduction. Not bad!

1.3         UJSon

There are other non-standard json libraries that we can try, one of them is Ujson , which claims to be an “Ultra fast JSON decoder and encoder written in C with Python bindings”. It is easily installable with:

pip install ujson

We run through the same tests in terms of writing and reading both compressed and uncompressed geojson files:

1.3.1        Writing

With small modifications we can use Ujson for writing a compressed file, which here takes 2min 21s ± 2.63 s:

  1. import gzip
  2. import ujson
  3. with gzip.open("outfile.json.gz", 'wt',encoding="ascii") as zipfile:
  4.     zipfile.write(ujson.dumps(data))

Similarly we can modify the above code to write an uncompressed file, which takes 4.94 s ± 50.6 ms :

  1. import ujson
  2. with open("outfile.json", 'w', encoding='utf-8') as regfile:
  3.     regfile.write(ujson.dumps(data))

1.3.2        Reading

Reading wise, we perform the same operations, first with compression taking 5.12 s ± 69.8 ms:

  1. import gzip
  2. import ujson
  3. with gzip.open("outfile.json.gz", 'rb') as zipfile:
  4.     data = ujson.loads(zipfile.read().decode('utf-8'))

And then with an uncompressed file taking 3.89 s ± 77.4 ms:

  1. import ujson
  2. with open("outfile.json", 'r', encoding='utf-8') as regfile:
  3.     data = ujson.loads(regfile.read())

1.4        Orjson

Another library we considered was Orjson (pip install orjson), which is described as a “Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy”.

1.4.1        Writing

When writing in compressed format, taking only 2min 17s ± 389 ms:

  1. import gzip
  2. import orjson
  3. with gzip.open("outfile.json.gz", 'wb') as zipfile:
  4.     zipfile.write(orjson.dumps(data))

However, if we’re willing to give up on compression we can write the uncompressed version in 1.89 s ± 22.7 ms:

  1. import orjson
  2. with open("outfile.json", 'wb') as regfile:
  3.     regfile.write(orjson.dumps(data))

1.4.2        Reading

Similarly reading a compressed file takes 4.77 s ± 69.8 ms:

  1. import gzip
  2. import orjson
  3. with gzip.open("outfile.json.gz", 'rb') as zipfile:
  4.     data = orjson.loads(zipfile.read())

or an uncompressed file takes 3.51 s ± 118 ms:

  1. import orjson
  2. with open("outfile.json", 'rb') as regfile:
  3.     data = orjson.loads(regfile.read())

1.5         MsgPack

Another option potentially considered is to save the JSON in a binary format, for example using Msgpack , which proports to be “It’s like JSON but fast and small”. This can be a viable option depending on your particular use case, but we should be aware that the above json files are “standard” in that they can be simply dragged + dropped into QuPath and QuPath will load them. Msgpack (and other binary json formats) are not recognized out of the box by QuPath so may require additional manipulation to get them loaded in successfully.

1.5.1        Writing

We can begin my writing an uncompressed version, which takes 7.46 s ± 1.31 s and yields a file of 140MB

  1. import msgpack
  2. with open("outfile.msgpack", 'wb') as regfile:
  3.     regfile.write(msgpack.packb(data, use_bin_type=True))

We can similarly write a gz compressed version, which takes 1min 9s ± 7.09 s and yields a file of 49MB:

  1. import gzip
  2. import msgpack
  3. with gzip.open("outfile.msgpack.gz", 'wb') as zipfile:
  4.     zipfile.write(msgpack.packb(data, use_bin_type=True))

1.5.2        Reading

When looking at the reading time, the uncompressed version needs 3.35 s ± 330 ms:

  1. import msgpack
  2. with open("outfile.msgpack", 'rb') as regfile:
  3.     data = msgpack.unpackb(regfile.read(), raw=False)

with the compressed version taking 4.81 s ± 868 ms:

  1. import gzip
  2. import msgpack
  3. with gzip.open("outfile.msgpack.gz", 'rb') as zipfile:
  4.     data = msgpack.unpackb(zipfile.read(), raw=False)

1.6        Ujson + Snappy

It is as well pointing out that we’re not constrained to the gzip compression format, we could as well use Snappy , which is a fast data compression and decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.

Notably, however, while the compression speeds are fast, and the resulting file size is quite small, we again lose direct compatibility with e.g., QuPath, so a careful decision of when to employ different compression technologies is warranted.

1.6.1        Writing

The writing requires a bit of code rework, but completes in an impressive 7.18 s ± 632 ms while yielding a compressed file of 91MB:

  1. import snappy
  2. import ujson
  3. compressed_data = snappy.compress(ujson.dumps(data).encode('utf-8'))
  4. with open("outfile.json.snappy", 'wb') as snappyfile:
  5.     snappyfile.write(compressed_data))

1.6.2        Reading

Reading also requires some rework, and yields a time of 5.41 s ± 466 ms:

  1. import snappy
  2. import ujson
  3. with open("outfile.json.snappy", 'rb') as snappyfile:
  4.     compressed_data = snappyfile.read()
  5.     decompressed_data = snappy.decompress(compressed_data)
  6.     data = ujson.loads(decompressed_data.decode('utf-8'))

2 Takeaways

This is the resulting table from our efforts, time in seconds for the laptop:

  Read Read   Write Write
Uncompressed Compressed   Uncompressed Compressed
Native Json 6.95 s ± 313 ms   7.25 s ± 105 ms     1min 39s ± 1.32 s   3min 44s ± 2.37  
Ujson 3.89 s ± 77.4 ms   5.12 s ± 69.8 ms     4.94 s ± 50.6 ms   2min 21s ± 2.63 s  
Orjson 3.51 s ± 118 ms   4.77 s ± 69.8 ms     1.89 s ± 22.7 ms   2min 17s ± 389 ms  
Msgpack 3.35 s ± 330 ms   4.81 s ± 868 ms     7.46 s ± 1.31 s   1min 9s ± 7.09 s  
Ujson + Snappy   5.41 s ± 466 ms       7.18 s ± 632 ms  

And for the server:

  Read Read   Write Write
Uncompressed Compressed   Uncompressed Compressed
Native Json 5.53 s ± 12.5 ms 6.58 s ± 13.3 ms   51.3 s ± 57.6 ms 2min 55s ± 176 ms
Ujson 3.67 s ± 11.4 ms 4.64 s ± 9.41 ms   4.22 s ± 10.1 ms 1min 54s ± 87.9 ms
Orjson 3.37 s ± 9.66 ms 4.4 s ± 17 ms   1.19 s ± 23.3 ms 1min 50s ± 48.7 ms
Msgpack 2.6 s ± 8.24 ms 3.19 s ± 4.94 ms   3.48 s ± 19.3 ms 53.9 s ± 161 ms
Ujson + Snappy   3.97 s ± 6.84 m     4.62 s ± 26 ms
Orjson + Snappy3.72 s ± 28.4 ms1.55 s ± 20.2 ms

We can see that considering other libraries beyond the native JSON library may be important, as each of them may provide specific advantages. As well, as expected, the difference in compute time is significant when transitioning between a laptop and a blade server!

From these tests, it appears that Orjson is either as performant or faster than Ujson, with both libraries being faster in all categories versus the native json library. Importantly, Ojson is able to write an uncompressed string about 2x-4x faster than Ujson.  For many of our tools, we often have to serialize geojson dictionary objects to store them in a database (which accepts json), and as such orjson has actually reduced our import times from existing geojson objects into a database by at least half!

Interestingly, since the geojson itself is a standard, it is possible to e.g., use one library for reading and another library for writing, without a problem. This compatibility is somewhat lost if we start looking at currently non-industry standard compression techniques.

For example, here we see that we can use Ujson + Snappy to get a significant speedup in terms of writing a smaller (~91mb vs ~310mb) compressed file, which is nearly on par in terms of speed to many of the uncompressed approaches. However, we lose drop + drag into QuPath functionality, implying that a secondary conversion process must take place before we can use them in such tools. This is sometimes a requirement….and sometimes not. Either way, it should be on our radar for consideration when chosing a tech stack.

One thing to keep in mind as you benchmark different approaches is the potential impact of caching, which may make some tasks seem very slow and then suddenly very fast once the results are cached  and being returned from that cache instead of recomputed.

There are a couple of ways to account for caching, particularly when the benchmark involves reading/writing files. One way is to always ensure that the cache is empty before e.g., reading/writing to the disk. This can be done with tools like vmtouch . Similarly, vmtouch can be used to *fill* the cache, to avoid any cold-cache penalties.

The option we’ve chosen here is instead to use “timeit”, which at minimum will run the same code 7 times, and reports the mean and standard deviation over all the runs. So while the first run may result in a slower run time due to a cold cache, (a) timeit recognizes this and provides a warning message that the first run was much slower than subsequent runs, and (b) we get some smoothing from the subsequent 6 runs to yield a more realistic real-world estimate.

That said, benchmarking can be very difficult, and one should attempt to mirror as close as possible their expected environment and use case to obtain the most comparable results possible. E.g., if you expect to always be loading from a HDD instead of an SSD, or if you always expect to have the file already cached in memory, that knowledge should have implications on your experimental and benchmarking design.

In the end, it appears once again that matching the “right tool to the job” is critical, with trade-offs possible to gain specific speed or size characteristics.

Happy GeoJsoning!

Code for reproducing these tests is available here and the same json we used available here.

Leave a Reply

Your email address will not be published. Required fields are marked *