As part of the scaling up of our QuickAnnotator tool, we executed a series of tests to benchmark backend technology. In particular, we were interested in looking at storage mechanisms for polygons which not only allow for their storage, but most importantly their spatial query. This implies that we could push geometries into a database, and then as part of a query, submit a second polygon (or bounding box) to identify those which intersect (among other spatial operations). The number of objects we were aiming for was at least 1 million rows in the database, as this is on the order of the number of unique cells within a typical whole slide image.
We as well wanted to consider different scalability options. It should not be surprising that if one wants to have extremely high-throughput, this typically requires the usage of more than 1 machine working in concert, which at the same time comes at the cost of additional setup and support complexity. It is not clear for us at the moment how much modern hardware has eliminated this cost, i.e., is 1 million “a lot” or “a little” with current technology. Regardless, one interesting way of managing complexity is through an abstraction layer, such that the backend can be readily and easily changed, without having an impact (or as small as an impact as possible) on the code being used to interact with that database.
1 What is SQLAlchemy?
For this abstraction purpose, we have opted to use SQLAlchemy, which describes itself as the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. Interestingly enough, it supports a number of different backends (e.g., SQLite, PostgreSQL, oracle, Microsoft SQL, MySQL, MariaDB), providing potential flexibility for employing them without having to modify code.
What is perhaps less known is that using Geoalchemy 2 with SQLAlchemy, adds support for Spatial Databases. This naturally makes them a suitable stack for our particular use case, and some tutorials on how they can generally be used together are widely available (e.g., from their site itself).
Another benefit of SQLAlchemy is that most, if not all, of our queries can be written using python code which focuses on objects, as opposed to writing SQL commands. SQLAlchemy ultimately looks at the python code you’ve written and converts it to the associated (ideally optimized) SQL code for execution. I can tell you from personal experience, in many situations writing python code is far less error prone, and more efficient to write, as opposed to writing direct SQL, so there are secondary benefits to taking such an approach beyond just abstraction.
We have great experience with it as the previous version of QuickAnnotator, as well as PatchSorter, use SQLAlchemy for a large portion of their database interactivity. There is sometimes a bit of a performance penalty to pay, as retrieving “rows” is faster than retrieving “objects” (which internally consists of retrieving the rows and then casting them to python objects), and another question to ask is what exactly is that penalty, which we hope to further investigate below.
2 Database Considerations
So what types of databases should we investigate for our particular use case? After looking at what was available, we found that extended versions of very popular databases are available which provide the spatial functionality we need. In particular we will look at SpatiaLite (SQLite + spatial), PostGIS (postgres + spatial), and then for extreme scaling Citus (postgres + spatial + distributed) or CockRoachDB. The latter two require a more complex setup, and as such we will tackle them in a second blogpost depending on how the results here on SpatiaLite and PostGIS play out.
Here we briefly introduce those database and how we installed them:
2.1 Spatialite
From their website: “SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.”
SpatiaLite builds on SQLite and is an open-source library intended to extend the SQLite core to support fully fledged Spatial SQL capabilities.
2.1.1 Setup
Amazingly, in a linux environment, it is often trivial to install:
- apt-get install -y libsqlite3-mod-spatialite sqlite3
and then once running sqlite3, simply load the extension with this command:
- .load 'mod_spatialite'
This is required to be done each time a new connection is made (we’ll see in code below), but the time consequence of this is measured in nanoseconds – so essentially ignorable.
2.2 PostGIS
PostgreSQL is a “powerful, open-source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.”
Similar to SpatiaLite, there is an extension version called PostGIS, which “extends the capabilities of the PostgreSQL relational database by adding support for storing, indexing, and querying geospatial data.”
2.2.1 Setup
For the setup, we rely on docker, and simply spin up an instance:
- docker run --name postgis -e POSTGRES_HOST_AUTH_METHOD=trust -d postgis/postgis
Then after creating a database we simply need to enable the PostGIS extension like so via execution of a SQL query (as we will see in code below).
- CREATE EXTENSION IF NOT EXISTS postgis;
3 Benchmarking
The backend needs for QuickAnnotator appear to be rather straightforward based on our previous version of the tool and our evolving software design document. In particular, we will aim to generally test:
- Insertion speeds
- Query speeds, by e.g., index – needed to pull out individual rows for DL training
- Spatial query speeds, by bounding-box – needed for frontend visualization
We aim to do so in both single threaded and multi-process functionality using our previously discussed engagement with the Ray framework. There are potentially a few quirks that we should consider, and these are discussed below in their respective sections.
Our expectation is that some users will want to use QuickAnnotator on a smaller machine, like their laptop, for more modest size datasets, while other users will want to use a larger machine (or even multiple machines), to support thousands of whole slide images.
As such, we aim to conduct benchmarking runs in both settings:
3.1 Laptop + Server Setups
For the purposes of these tests, we’re using (a) a ThinkPad X1 Extreme laptop, which has 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz 2.61 GHz (turbo max 5GHz), 32GB of RAM, and NVMe harddrive and (b) a large blade server with 2x Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (turbo max 3.20 GHz), 256GB of ram, as well as an NVMe harddisk.
3.2 Example Data
Also similar to our previous blog post, to perform these tests we’re using a 310MB geojson plain text file, available compressed here, which contains annotations from our recent kidney tubule preprint. This particular collection contains 88,605 polygons from tubules, tubular basement membrane, lumen, nuclei, etc (as described in the manuscript). For ease of testing, we can insert this file multiple times without any loss of generalizability. An example screen shot is here:

4 Brief Introduction to Spatial concepts using SQLAlchemy + SQL
We don’t intend to provide a full tutorial to SQLAlchemy, which is already well done on their site and available here. Instead, we provide a brief introduction to three different concepts.
Object-Relational Mapping (ORM): This is a technique that allows you to interact with your database using Python classes and objects rather than writing raw SQL queries. Simply define your classes, and the objects that are created will be persisted in the backend. This provides a powerful and intuitive way to model and query spatial data, however, it does tend to come with additional query overhead (e.g., mapping sql results back to python objects can be expensive). If we need, we can avoid using ORM since (a) our chosen DBs all support the same SQL syntax, so we can switch between them without having to modify our SQL statements, and (b) our statements are fairly straightforward, and rarely (never?) e.g., contain joins, necessitating complex query planning. The question we hope to answer here is clearly the cost in terms of code complexity vs time efficiency. Regardless, the relational mapping is especially useful for creating our tables and indexes!
SQLAlchemy Core: SQLAlchemy’s Core is a more granular and flexible way to interact with the database. It provides a more explicit, SQL-centric approach that can be particularly useful when working with complex spatial queries or bleeding edge efficiency is required.
ST_ Spatial Functions: These are specialized SQL functions for handling spatial data, such as calculating distances, checking for intersections, and transforming coordinates. We’ll look at how these functions can be integrated into SQLAlchemy queries to perform the spatial analyses needed by us. Some of these have significant time implications, see our benchmarks below!
Note that in the SQL statements, there are sometimes *small* differences between these database function names, one should consult the associated documentation (SpatiaLite, PostGIS), to know the correct naming, but generally sometimes in PostGIS one needs to add “ST_” before the function, while in SpatiaLite they need not. We intend to see if SQLAlchemy successfully abstracts away from these small nuanced differences, thus giving greater flexibility when transitioning between backends.
ST_ Functions that we’ve used below which may not be obvious to the typical SQL user:
GeomFromGeoJSON (SpatiaLite) / ST_GeomFromGeoJSON (PostGIS) – these accept geojson strings and convert them in to “geometry” objects (the base object used in spatial databases), which can have geometric operations performed on them, or used for storage. An example of its usage is shown below, which pushes a geojson string from our geojson file into the database:
- res=conn.execute(text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)
where polygons is a list of dictionaries with the associated keys:
- polygons.append({'name':name,'geom':geometry})
ST_Centroid – takes a geom and computes its centroid, as a geom, which can either be processed further or subsequently converted to a more favorable format, e.g., geojson
AsGeoJSON (Spatilite) / ST_ AsGeoJSON (PostGIS) – Converts a GEOM object to a geojson string – this is quite interesting as this string can then be e.g., loaded directly into shapely, or could be returned to a front end which uses GeoJS to plot geoms!
Those two functions together, like this, would return a list of all the centroids in geojson format computed dynamically:
- res=conn.execute(text("select AsGeoJSON(ST_Centroid(geom)) as centroid from geometries"))
ST_Intersects – This function checks if two geometries intersect, returning a boolean value (true or false). It’s particularly useful for spatial queries where we need to filter down results based on whether or not geometries overlap. For example, we could query a database to find all geometries that intersect with a given polygon, making it easy to identify areas that overlap with a particular region. A query like this would return all records that intersect with a specified geometry:
- res = conn.execute(text("select * from geometries where ST_Intersects(geom, ST_MakeEnvelope(-10, -10, 10, 10, 4326))"))
BuildMBR (SpatiaLite), ST_MakeEnvelope (PostGIS) – This function creates a rectangular geometry defined by minimum and maximum X and Y values. It’s ideal for quickly defining a bounding box for spatial queries, especially when you want to filter data by coordinates. For example, you could use ST_MakeEnvelope to define a bounding box that only returns geometries within a certain area. Combining it with ST_Intersects, you could get all geometries within a bounding box like this:
- res = conn.execute(text("select * from geometries where ST_Intersects(geom, ST_MakeEnvelope(-180, -90, 180, 90, 4326))"))
AsEWKB (SpatiaLite), ST_AsEWKB (PostGIS) – Converts a geometry object into Extended Well-Known Binary (EWKB) format. This format extends WKB by including SRID information, making it highly portable for spatial databases and applications that require SRID metadata. It’s ideal for transferring geometries with spatial reference context across different systems. For example, you could retrieve a geometry in EWKB format and send it to another database, maintaining both its shape and SRID:
- res = conn.execute(text("select AsEWKB(geom) as ewkb_geom from geometries"))
5 Shared Code
For simplicity, here we note the code that is shared across all the uses cases.
5.1 SQLite
Below is how we create a sqllite engine, and provide an event listener, so that when a new connection is created, the spatialite extension is loaded:
- # Connect to the Spatialite database
- db_path = 'sqlite:///spatialite.db'
- engine = create_engine(db_path, echo=False) # Set echo=True to see SQL commands being executed
- # Initialize Spatialite extension
- @event.listens_for(engine, "connect")
- def connect(dbapi_connection, connection_record):
- dbapi_connection.enable_load_extension(True)
- dbapi_connection.execute('SELECT load_extension("mod_spatialite")')
Note that SQLite has a number of pragmas which can improve performance, we started by using the following (after a conversation with ChatGPT), but in practice, at least for the insertion use case of 1 million records, a mere 7 second time difference was noted, and thus we opted to proceed without them for all tests.
It seems that a majority of the computation time is in fact in the spatial components of the queries and not in the read/writing to the database. There can of course be tuned and would be a subject of their own blog post, but are outside of the scope here – our main question is if the DB can overall provide the performance we need without significant tweaking. Regardless, if you’re doing serious optimization, you may want to look into them:
- dbapi_connection.execute("PRAGMA synchronous = OFF;")
- dbapi_connection.execute("PRAGMA journal_mode = OFF;")
- dbapi_connection.execute("PRAGMA locking_mode = EXCLUSIVE;")
- dbapi_connection.execute("PRAGMA temp_store = MEMORY;")
- dbapi_connection.execute("PRAGMA cache_size = -64000;")
- dbapi_connection.execute("PRAGMA count_changes = OFF;")
- dbapi_connection.execute("PRAGMA foreign_keys = OFF;")
- dbapi_connection.execute("PRAGMA mmap_size = 268435456;") # Example size (256 MB)
5.2 PostGIS
Conceptually similar to sqllite, with creating the database engine and then creating a listener to ensure that the postgis extension is present. This likely needs to only be done once, but with the “if not exists”, always ensures that our DB is correctly set up, with minimal time consequence:
- engine = create_engine('postgresql://postgres@localhost:5432/test1')
- # Initialize Spatialite extension
- @event.listens_for(engine, "connect")
- def connect(dbapi_connection, connection_record):
- with dbapi_connection.cursor() as cursor:
- cursor.execute('CREATE EXTENSION IF NOT EXISTS postgis;')
5.3 SQLAlchemy
Beyond the necessary imports (please look directly at the code samples, provided in the github here), we create a declarative base from which to inherit our classes:
- # Create a base class for our declarative mapping
- Base = declarative_base()
Then we define our class, which maps to a database table named “geometries”, with the associated columns named by variables. Importantly the Geom column is defined as a “Geometry” of type “POLYGON”, which means that only polygons can be stored in it. There are other supported types, or a universal type can be used, but these are outside of the scope of this post:
- # Define your SQLAlchemy model
- class GeometryModel(Base):
- __tablename__ = 'geometries'
- id = Column(Integer, primary_key=True)
- name = Column(String)
- geom = Column(Geometry('POLYGON'))
When you’re done creating all the associated classes in “Base”, proceed to actually execute their creation:
- # Create the table
- Base.metadata.create_all(engine)
And off you go! You can create a session like so, which interestingly is abstracted from the actual backend database!
- # Start a session
- from sqlalchemy.orm import sessionmaker
- Session = sessionmaker(bind=engine)
- session = Session()
6 General Conversions
Our data is in a long list of dictionaries, with each dictionary containing a key called “geometry” which contains our polygon, as represented by geojson. As such, to begin, we would like to know how long it takes to take 1 million of these geojson objects and convert them to (a) shapely objects, and then (b) extract the Well Known Text (WKT) representation. Ultimately this will form a minimum baseline, as we cannot expect e.g., insertion to take less time than actually loading and prepping the data for insertion.
As we will see below, many of these spatial databases “speak” both WKT and GeoJSON, though internally they have their own proprietary formats. General expectation will be that WKT will be more efficient in terms of speed and space, since it is a more dense format that is easier to parse, but we can see what the resulting tests show.
We perform three basic tests:
- %%time
- for _ in range(12):
- for geojson in tqdm(data):
- name = geojson["properties"]["classification"]["name"]
- geometry = orjson.dumps(geojson["geometry"]).decode('ascii')
and
- %%time
- for _ in range(12):
- for geojson in tqdm(data):
- name = geojson["properties"]["classification"]["name"]
- shapely_geom = shape(geojson["geometry"])
and
- %%time
- for _ in range(12):
- for geojson in tqdm(data):
- name = geojson["properties"]["classification"]["name"]
- shapely_geom = shape(geojson["geometry"])
- wkt=shapely_geom.wkt
Which convert our geojson to a string, shapely object, and wkt. We do this in both single core and multi-core, using ray, to produce this table:
Server | Laptop | |
Convert from GeoJSON to String (orjson) | 11s | 4.71s |
Convert from GeoJSON to Shapely | 3min 14s | 1min 44s |
Convert from GeoJSON to Shapely to WKT / WKB | 4min 19s | 2min 23s |
Convert from EWKB to Shapely | 11.9s | 6.29s |
Multi-proc – Convert from GeoJSON to Shapely to WKT | 55s | 41.8s |
Which demonstrate that this conversion is non-trivial, taking multiple minutes on average. This is still a pretty amazing number, as it suggests that the machines are able to process ~4,000 polygons per second, on a single core. Not bad! When using ray across multiple cores, we can see how this changes significantly, reaching a whopping 18,000 polygons per second! This suggests (unsurprisingly) that when scaling to a large number of objects, intelligently using the available computer power will be critical. Interestingly, when we look at the database conversion timings above, and the insertion timings below, we see that a majority of the time spent in dealing with 1 million records is actually transforming it into formats which are accepted by the databases.
Notably, even though the database accepts WKT and GeoJSON, its internal format is unique to it (and different between SpatiaLite and PostGIS), which is important as this enables e.g., better indexing, storage efficiency, and search, but there is a consequence in getting data into that format, which doesn’t appear to be avoidable.
7 Insertions
For both SpatiaLite and PostGIS we then check insertion speeds for (i) 1 million polygons and (ii) 1 million polygons with centroid calculation, with (i) ORM and (ii) Core. This was done both in a single thread:
Server | Laptop | |||
SpatiaLite | PostGIS | SpatiaLite | PostGIS | |
No Insert – Convert String in DB using GeomFromGeoJSON | 7min 7s | 13min 21s | 4min 49s | 15min 40s |
Insert Geom-Only – ORM (Shapely) | 10min | 9min 19s | 7min 30s | 6min 38s |
Insert Geom-Only – Core (Orjson) | 6min 42s | 11min 36s | 4min 57s | 14min 20s |
Insert Geom + Centroid – ORM (Shapely) | 10min 44s | 9min 49s | 9min 18s | 7min 21s |
Insert Geom + Centroid – Core (Orjson) | 12min 50s | 13min 42s | 10min 19s | 18min 54s |
Insert Geom + Centroid + Optimized Core (Orjson) | 7min 48s | 15min 34s | 6min 4s | 19min 55s |
… and using a Ray cluster, sending a batch of 5k polygons:
Server | Laptop | |||
SpatiaLite | PostGIS | SpatiaLite | PostGIS | |
Insert Geom-Only – ORM (Shapely) | 6min 19s | 1min 14s | 5min 36s | 2min 28s |
Insert Geom-Only – Core (Orjson) | 1min 22s | 2min 12s | ||
Insert Geom + Centroid – ORM (Shapely) | 1min 21s | 2min 27s | ||
Insert Geom + Centroid – Optimized Core (Orjson) | 1min 34s | 3min 21s |
We can easily compare the difference between the ORM and the Core, in terms of code complexity, noting that there are subtle differences between the PostGIS version and SpatiaLite version when using the core, but not when using the ORM. As well, having to write (and more importantly figure out) the exact insertion syntax was non-trivial. E.g., in SpatiaLite the requirement to explicitly set the SRID is likely unknown to most lay users. On the other hand, we can see that using the Core we’re able to directly ingest GEOJSON by using the associated “GeomFromGeoJSON” function, which is executed within the database.
Another interesting point to note is that here we’re using a single “conn.execute” statement, and providing it with a list of dictionaries to do the insertion. There are in fact two ways that this can happen internally. The naïve way would have 1 insert statement being executed per item in the dictionary, the more optimized way would have a batch insertion statement where all the items in the list are being inserted with a single insertion statement. There are some good notes about this in the documentation:
Behind the scenes, the Connection object uses a DBAPI feature known as cursor.executemany(). This method performs the equivalent operation of invoking the given SQL statement against each parameter set individually. The DBAPI may optimize this operation in a variety of ways, by using prepared statements, or by concatenating the parameter sets into a single SQL statement in some cases. Some SQLAlchemy dialects may also use alternate APIs for this case, such as the psycopg2 dialect for PostgreSQL which uses more performant APIs for this use case.
https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters
Relatedly, and perhaps not relevant here, we can see that if we want to bulk insert rows using core, we can do so with a unique syntax, but this is not applicable here since we don’t want to insert the “geojson” version, we instead want to apply a function to it, converting it internally to a GEOM. We mention it here in case there are other use cases which would fit into this paradigm, which would indeed be the optimal way:
- with engine.connect() as conn:
- result = conn.execute(
- insert(user_table),
- [
- {"name": "sandy", "fullname": "Sandy Cheeks"},
- {"name": "patrick", "fullname": "Patrick Star"},
- ],
- )
- conn.commit()
So the question of if the operation was done in the naïve way, or optimal way, is essentially hidden from us, as we would need to e.g., turn “echo=True” and verify the statements. At least with the versions I am using, the end result of this operation appears to be a single insert statement per item, which in theory should produce additional overhead (the ORM version I confirmed to use the more optimal batch inserts), except we see that the insertion time is lower. The numbers in the previous section seem to suggest we’re paying more of a penalty in converting the objects to shapely, as opposed to letting the database ingest them as GeoJSON. On the other hand, we’re paying in code complexity, and reduced transportability between backends. These are the types of factors that become involved when making a decision of how to implement a software project. In this case, it looks like both approaches provide more than sufficient amounts of throughput for our use cases, so there would be a natural preference to go with the easier less error prone code (i.e., ORM).
If we compare the four “Insert Geom-Only” experiments, we see something quite interesting. The Core (Orjson) experiment in PostGIS nearly doubled the compute time of the next slowest experiment (14min 20s vs. 7min 30s). This may seem a bit strange, why would bulk insert provide such a significant boost? We do expect some improvement, but this much is unreasonable. The important factor is to look at the cost for “No Insert – Convert String in DB using GeomFromGeoJSON”, this is essentially on par with the insertion time for the core. Ultimately, it looks like PostGIS functions which perform this conversion are not as optimized as other libraries that are available. For example, when we use the ORM approach, we need to (a) first create a shapely object, and then (b) extract its WKT in order to instantiate the python object. As a result, we’re not using the inbuilt PostGIS functions, and are seemingly instead using a much more optimized library given to us by shapely. In theory, if one wanted to use the Core approach, one could want to again perform this conversion in the “application” space instead of in the “database” space, so that more efficient libraries can be used.
A few other notes – it appears the results SpatiaLite db size is almost 2x larger than that of PostGIS (4GB versus 1.8GB), however it looks like the insertion time for SpatiaLite is indeed much quicker than PostGIS in the single threading space. However, when we transition to multithreading, we can see that PostGIS is now significantly faster, mainly since the SQLite backend is designed to be multi-reader single writer, and thus all write operations are performed serially. Postgres on the other hand successfully manages multiple write operations at the same time, enabling greater throughput. Naturally this has a consequence on which technology stack we should choose.
7.1 Insertions with Centroid Computation
Importantly, if we look at the Core version’s centroid insertion code, we see it was done naively, where the GeomFromGeoJSON is likely being executed twice, and yet is potentially the most expensive process, as we note the computation time is almost doubled, which is unlikely to be solely the case from simply computing the ST_Centroid function:
- res=conn.execute(text("INSERT INTO geometries (name,geom,centroid) VALUES (:name,GeomFromGeoJSON(:geom),ST_Centroid(GeomFromGeoJSON(:geom)));"),polygons)
Instead we can try with this version, which essentially reuses the geom, we call this the “optimized” version, and goes to show how designing queries is important to execution time. Note that we must explicitly set the SRID otherwise we get an error.
- res=conn.execute(text("INSERT INTO geometries (name, geom, centroid) SELECT :name, geom, ST_Centroid(geom) FROM (SELECT SetSRID(GeomFromGeoJSON(:geom),-1) AS geom) AS subquery;"),polygons)
As a comparison, we can simply run a query to compute and return all the centroids, like so:
- %%time
- with engine.connect() as conn:
- res=conn.execute(text("select AsGeoJSON(ST_Centroid(geom)) as centroid from geometries"))
- centroids=res.fetchall()
Which for 1 million polygons, takes ~18s in SpatiaLite and ~6s in PostGIS (see below), which begs the question of where/when this conversion should be performed. E.g., If not operating in parallel, it may make sense to insert the geom into the database, and then compute/save the centroid as an UPDATE.
We note that SpatiaLite is essentially the same if you use single or multi-processing, except we need to add a very long PRAMGA busy_timeout to prevent failure due to database lockage. This is to be expected as it only supports a single writing thread, so it becomes clear that if one needs massive insertion throughput, a proper database system is required.
Interestingly, when we add on centroids to the 1 million rows, in all cases, the database size increases only modestly, from either 4GB to 4.1GB (SpatiaLite) or 1.8GB to 1.9GB (PostGIS) – suggesting the bulk of our storage is in polygon storage as well as row overhead. This is to be expected.
Lastly, we note that there still remains quite a bit optimization that can be done, via pramas or configuration parameters, optimizing the number of DB workers, batch sizes, ray workers, etc. Ultimately, however, looking at the maximum speeds we’re obtaining (e.g., ~3k/sec inserts with SpatiaLite and ~35k/sec with PostGIS), it seems that either chose will greatly exceed the minimum requirements we have.
8 Query All Records
Another component we wanted to examine was how quickly we could load in 1 million records from the two different databases using both the Core approach and ORM approach. A major point of this exercise was as well to understand how the different file formats impacted the retrieval time. We can see the resulting table below:
Server | Laptop | |||
SpatiaLite | PostGIS | SpatiaLite | PostGIS | |
Retrieve all rows (ORM: EWKB + Convert) | 11min 19s | 3min 54s | 8min 21s | 3min 52s |
Retrieve all rows as GeoJSON (ORM) | 3min 14s | 16 s | 4min 10s | 30.8 s |
Retrieve all polygons as GEOM (Core) | 5.11 s | 31.1 s | 10.3 s | 47.9 s |
Retrieve all polygons as GeoJSON (Core) | 3min 8s | 11.8 s | 2min 39s | 31.6 s |
Retrieve all polygons as WKB (Core) | 8min 6s | 31.4 s | 6min 15s | 57.2 s |
Retrieve all polygons as WKT (Core) | 3min 4s | 9.76 s | 2min 36s | 19.7 s |
At a high level, it appears beneficial to almost always use PostGIS versus spatiliate. This is likely due to PostGIS having a better internal caching mechanism, and as well being designed for more high throughput settings. What is potentially most interesting is that the way in which the data was returned had a large impact on performce, be it WKB, WKT or GeoJSON. What is perhaps surprising to me is that the GEOM retrieval appears to take longer than the GEOJSON retrieval, even though the GEOJSON retrieval requires some computational overhead to perform the conversion, while I would expect GEOM to simply be returning exactly what is in the database. One potential explanation for this is that the GEOM datatype requires more memory than a GeoJSON (the latter being more information dense). A random spot check showed that the length of a GEOM is twice as long as a GeoJSON, thus aggregating significant additional memory allocation and retrieval when looking at 1 million rows.
While our fastest format seems to be a WKT, there is a question of the usability of this format for other processes. Given the results above, it seems that having an internal format of WKT and converting to e.g., shapely or geojson when needed is likely going to be a faster operation overall than using GeoJSON everywhere. That said, the time consequence appears to be minimal, < 10 seconds difference (over 1 million records) as compared to GeoJSON, so both would seem to be essentially on par, and a format that is easier to use with other libraries (e.g., can be directly plotted in the front end without additional manipulation) is likely the way to go.
Although retrieving all polygons as python objects using ORM appears to be quite time consuming, is it very nice to see that retrieving the polygons as GeoJSON is quite fast, nearly on par with using SQL statements directly, again supporting the notion that an ORM approach is likely to be the less error prone, faster to develop, way to go forward.
9 Dynamic Computation
As a quick aside, with 1 million rows in the database, it became interesting to understand what the computational time would be for computing each of the centroids dynamically would be. We can see the associated results in this table:
Server | Laptop | |||
SpatiaLite | PostGIS | SpatiaLite | PostGIS | |
Compute dynamically centroids of all polygons WKB (ORM) | 26 s | 15 s | 27.2 s | 13.3 s |
Compute dynamically centroids of all polygons GeoJSON (ORM) | 20.2 s | 8.45 s | 16.9 s | 7.95 s |
Compute dynamically centroids of all polygons (Core) | 19.5 s | 7.15 s | 19.4 s | 6.86 s |
Compute dynamically centroids of all polygons GeoJSON(Core) | 19.2 s | 6.83 s | 18.5 s | 7.19 s |
Compute dynamically centroids of all polygons EWKT(Core) | 19.6 s | 7.26 s | 17.2 s | 6.82 s |
Compute dynamically centroids of all polygons EWKB(Core) | 21.6 s | 9.5 s | 18.8 s | 9.14 s |
Mainly what we see is that the speed at which PostGIS dynamically computes centroids is likely faster than the equivalent operation in SpatiaLite. Again our ORM approach appears to be essentially on par with a Core approach, suggesting its utility. There don’t appear to be any major differences in terms of conversion time or retrieval time, which is interesting because that wasn’t the case in our benchmarks returning polygons. In this case we are returning simply a POINT geom type, which is quite small in terms of bytes, suggesting that the major overhead in the previous use cases was in fact the conversion of polygons to other formats. Here, the polygon is internally used with a centroid calculated (apparently a fast operation), and then this x,y coordinate is converted into the desired format. The algorithmic complexity of this operation is similarly low, so the conversion and retrieval appear to be mostly minimized. Its interesting to note here, that when the dynamic computation of centroids operation is performed in 7 seconds, we’re essentially computing and returning the centroid of ~143,000 polygons *per second*, a truly impressive throughput!
10 Query by ID Single Thread
The next question of course is, how long does it take to get specific data out? For this we randomly select an ID and return the row to the client. We have two approaches here, one where we ask the server to return the GEOM as a geojson, and the other which returns it directly. The internal format is indeed a WKB, which can additionally be converted to a shapely object like so, if needed:
- random_row = conn.execute(
- text(f'''
- SELECT id,geom
- FROM geometries
- WHERE id = {random_index}
- ''')
- ).fetchone()
- obj=shapely.wkb.loads(random_row[1])
Our results are showcased here:
Server | Laptop | |||
Spatialite | PostGIS | SpatiaLite | PostGIS | |
Query id single process ORM ID only | 971 μs ± 26.3 | 2.68 ms ± 351 μs | 782 µs ± 45.8 µs | 2.54 ms ± 283 µs |
Query id single process ORM + Shapely Convert | 1.16 ms ± 30.4 μs | 2.49 ms ± 99.2 μs | 917 µs ± 41.4 µs | 2.41 ms ± 203 µs |
Query id single process ORM + Shapely Area | 1.15 ms ± 33.8 μs | 2.52 ms ± 129 μs | 876 µs ± 35.6 µs | 2.55 ms ± 71.1 µs |
Query id single process WKB + Shapely Convert | 869 μs ± 14.2 μs | 1.92 ms ± 60.9 μs | 631 µs ± 51.2 µs | 2.09 ms ± 89 µs |
Query id single process WKB + Shapely Area | 880 μs ± 27.5 μs | 2.02 ms ± 106 μs | 619 µs ± 26.3 µs | 2.11 ms ± 86.4 µs |
Query id single process GeoJson | 382 μs ± 7.91 μs | 1.31 ms ± 30.6 μs | 301 µs ± 22 µs | 1.84 ms ± 48 µs |
Query id single process GeoJson + Shapely Area | 630 μs ± 12.8 μs | 2.06 ms ± 98 μs | appro | 2.2 ms ± 218 µs |
Query id single process GeoJson + Dynamic Centroid | 421 μs ± 9.39 μs | 1.5 ms ± 63.5 μs | 346 µs ± 20.3 µs | 1.84 ms ± 59.9 µs |
We can see our databases are amazing at their job, you give them an id and they will get you your data in under 2ms – either technology is certainly usable! Interestingly, however, we see that SpatiaLite has a small edge on PostGIS, rather surprising! I suspect, however, that we’ll see this disappear when we transition to a parallel processing approach. Despite the previously observed time cost of converting GEOMs to GeoJSON, “Query id single process GeoJson” yielded the fastest times here across the board, perhaps demonstrating that the effect of SQLAlchemy ORM overhead becomes more pronounced in these single-item retrieval tasks. Even though the ORM is slower than Core at querying a single item by ID, both approaches are still lightning quick.
11 Query by ID Multi Thread
We use ray to perform 10,000 queries of reading single random row. We begin with a baseline measurement, which doesn’t use any database components at all, and instead demonstrates the minimum compute time needed for ray to spin up and launch tasks in bulk. We measure both 100 tasks and 10,000 tasks – ultimately it would be impossible (?) to ever be faster than this benchmark, so we can use subsequent measurements as a “delta” to understand the additional computational time needed to perform the querying.
Once again, our results:
Laptop | Server | |||
Query id multi process 100 tasks – baseline | 244ms | 70ms | ||
Query id multi process 10,000 tasks – baseline | 1.64s | 2.84s | ||
SpatiaLite | PostGIS | SpatiaLite | PostGIS | |
Query id multi process 10,000 queries ORM ID only – Bulk (128 tasks, 80 queries) | 536ms | 1.66 s | 2.32s | 2.48s |
Query id multi process 10,000 queries ORM ID only | 10s | 20.9s | 42.6s | 44.8s |
Query id multi process 10,000 queries ORM + Shapely Convert | 8.92s | 20.9s | 42s | 44.9s |
Query id multi process 10,000 queries ORM + Shapely Area | 9.05s | 21.5s | 41.7s | 45.4s |
Query id multi process 10,000 queries WKB + Shapely Convert | 8.08s | 21.5s | 38.4s | 41.3s |
Query id multi process 10,000 queries WKB + Shapely Area | 8.12s | 20.5s | 37.4s | 40.8s |
Query id multi process 10,000 queries GeoJson | 8.03s | 21.5s | 37.9s | 40.4s |
Query id multi process 10,000 queries GeoJson + Shapely Area | 7.97s | 20.2s | 37.7s | 39.9s |
Query id multi process 10,000 queries GeoJson + Dynamic Centroid | 8.02s | 20.2s | 36.7s | 40.2s |
What we see is that for 100 tasks, the laptop and server perform in 244ms and 70ms, respectively, while for 10,000 tasks we see 1.64s versus 2.84s, interestingly noting that the “winner” swaps between them likely because to begin 100 tasks is more bottlenecked by CPU speed, of which the laptop boost if faster, while afterward for 10,000 tasks it becomes more advantageous to have many more CPUs grinding away at them.
We then look at the bulk versus individual task options. In the bulk version, we spin up 128 tasks, and ask each of them to retrieve 80 rows by a randomly selected ID. We can ultimately see the compute time is comparable to that of the baseline, implying that the totality of the compute time seems to remain within the ray task processing and not necessarily the database components.
We then issue 10,000 individual tasks and retrieve the data in different formats, and immediately can see that the database overhead consumes almost all of the compute time, with there being little different in the “type” of data returned (WKB, geojson, dynamic centroid, etc etc), suggesting that most of the time is actually spent in setting up the database connections and not in the ray task overhead. This suggests that if one were to use this approach, they would likely want to consider ways to use database connection pools, or other methods (e.g., ray actors) to reuse connections so that the performance more closely matches that of the bulk queries. Interestingly, if we divide 10,000 queries by e.g., 40s seconds, we see that we’re doing 250 queries per second, or 4ms per query, which is still quite snappy. We can compare this number to the single core direct processing however, and note that it is ~4x slower to go through ray, which again indicates the associated overhead of using ray for multiprocessing.
12 Query by Spatial Centroid vs Intersection
Next we look at spatial queries, which are one of the main reasons why we’re using a spatial database. The way we proceed is we first (a) randomly select a polygon from the database, (b) compute its centroid, (c) use that centroid to create a bounding box, (d) query the database for any polygons intersecting with the bounding box, and lastly (e) return different items such as “counts” or “ids” or “geojson polygons”. We additionally compare the query processing of using either (a) the stored polygon itself (which asks, is this polygon within the bounding box?), or (b) the stored centroid which was precomputed during the insertion process above. We also compare both Core approaches and ORM approaches to ensure we’re not losing too much performance if we opt to go the more sustainable ORM approach.
We modulate the size of the bounding box to control how many rows we return, to determine scalability – due to the probabilistic nature of the randomly selected row, this is more of an approximate measure as opposed to a fixed quantity, but should still give us an e.g., order of magnitude approximation of the associated times to see if anything is really out of wack. I used bounding box sizes, 100, 350, 1200 and 6000 to query for approximate returned row counts of 100, 1000, 10,000 and 100,000, respectively.
The resulting tables are too large to fit inline, but were included in our full set of benchmark results here. At a high level we can generally see that for PostGIS the Core and ORM versions seem to perform relatively on par, in the sense that the computation times are pretty darn quick. We can see that counting only 100 rows is 9.7ms versus 19ms (core vs ORM), but that gap shrinks when looking at e.g., 1000 rows, 33ms vs 36ms, suggesting that there is some randomness in the result set length. A more official benchmark would have used e.g., setting random seeds to try and ensure the same exact queries are being performed, but this was out of scope for our work. Regardless, as we’re looking at these times, none of them really pop out as something which is a deal breaker for our use case, so such strictness was not necessary here. Even when looking at full intersections with the polygon themselves, we see reasonable times of ~70ms for 1,000 rows which is still really quite a lot of items for a very short amount of time. A major point to note is that, again, the actual format seems to be a strong indicator of overall computation time; i.e. if you really need GeoJSON output versus just the id numbers of the rows, you’re going to be paying a premium for it. This should be considered in the context of the experiments done above looking at the different return types – from there one can begin to figure out which are the faster formats, which if usable within your pipeline, are potentially a more apt way of going forward.
13 Optimizing SpatiaLite’s Spatial Query
One important note, however, is that almost all of the SpatiaLite queries appear to take the same amount of time. This seemed very strange, and I spent quite a lot of time digging into this trying to debug it. One would definitely expect to see some type of monotonically increasing compute time as the number of rows go up an order of magnitude. At the same time, we can see that the typical compute time is on par with e.g., PostGIS using 100,000 rows, regardless of the number of rows returned, suggesting that SpatiaLite is doing *a lot* more work for 100 rows than it should be doing.
After quite some time, I discovered that there is a bit of an issue with the way I was using the Spatial index, and posted an associated github issue here: https://github.com/geoalchemy/geoalchemy2/issues/520
To understand what was going on, I had set the engine echo = True and saw the query:
- 2024-08-14 00:25:45,980 INFO sqlalchemy.engine.Engine BEGIN (implicit)
- 2024-08-14 00:25:45,984 INFO sqlalchemy.engine.Engine SELECT geometries.id AS geometries_id, geometries.name AS geometries_name, AsEWKB(geometries.geom) AS geometries_geom
- FROM geometries
- WHERE ST_Intersects(geometries.geom, BuildMbr(?, ?, ?, ?))
- 2024-08-14 00:25:45,986 INFO sqlalchemy.engine.Engine [no key 0.00186s] (29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455)
I copy and paste this into sqlite3 and see that indeed a full scan has taken place:
- sqlite> explain query plan SELECT geometries.id AS geometries_id, geometries.name AS geometries_name, AsEWKB(geometries.geom) AS geometries_geom
- FROM geometries
- WHERE ST_Intersects(geometries.geom, BuildMbr(29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455));
- ;
- QUERY PLAN
- `--SCAN geometries
… even though it does indeed think an index exists. So essentially SpatiaLite was always performing a full table scan instead of using the search index.
Digging deeper I found these pages (1, 2, 3), which shows the correct way to write the query:
- SELECT *
- FROM com2011 AS c, prov2011 AS p
- WHERE ST_CoveredBy(c.geometry, p.geometry) = 1
- AND nome_pro = 'AREZZO' AND c.ROWID IN (
- SELECT ROWID
- FROM SpatialIndex
- WHERE f_table_name = 'com2011'
- AND search_frame = p.geometry);
One piece missing, that will trip you up if you don’t realize it, is that if you *also* have multiple geometries, you need to specify the column, otherwise you won’t get any rows returned.
In the end, this is what my query looked like:
- npolys = conn.execute(
- text(f'''
- SELECT count(geom)
- FROM geometries
- WHERE geometries.ROWID IN (
- SELECT ROWID
- FROM SpatialIndex
- WHERE f_table_name = 'geometries'
- AND f_geometry_column = 'geom'
- AND search_frame = BuildMBR(
- {centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
- {centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
- ))
- ''')
- ).fetchone()
And I redid the same queries as above. We can see from the table below, that did indeed have a huge impact on the compute time, bringing the 100 rows down from ~1 second to ~1ms — the power of indexes! I note that all of these queries are using Core…after spending quite a bit of time (and working with ChatGPT), I could not devise a way to perform the same query using ORM, which isn’t exactly great. Even these queries, as you see above, are a bit complex and really require careful attention to get everything lined up correctly. The compute time is of course worth it, but if you take that route be sure to read all the associated documentation and check and double check that your query is correct!
14 Conclusion
Looking at these results, none of them discredit using either SpatiaLite or PostGIS. Remembering that the database had over 1 million rows it in, and is still able to serve queries quite quickly, suggests that there is no bad choice here. If we look at e.g., PostGIS, it is going to provide an easier coding approach, one can use ORM and the spatial index happens naturally, but we have to pay in terms of infrastructure setup time, although this is made much easier with e.g., docker. On the other hand, SpatiaLite is trivial to set up, but if you intend to do spatial queries, expect to spend more time debugging them to ensure that they’re working the way you think they’re working.
We present here, the full set of our benchmark results and as well here, all of the associated code used to generate these results!
Best of luck spatially searching!