Tagged with pyqgis

Speeding up your PyQGIS scripts

I’ve recently spent some time optimising the performance of various QGIS plugins and algorithms, and I’ve noticed that there’s a few common performance traps which developers fall into when fetching features from a vector layer. In this post I’m going to explore these traps, what makes them slow, and how to avoid them.

As a bit of background, features are fetched from a vector layer in QGIS using a QgsFeatureRequest object. Common use is something like this:

request = QgsFeatureRequest()
for feature in vector_layer.getFeatures(request):
    # do something

This code would iterate over all the features in layer. Filtering the features is done by tweaking the QgsFeatureRequest, such as:

request = QgsFeatureRequest().setFilterFid(1001)
feature_1001 = next(vector_layer.getFeatures(request))

In this case calling getFeatures(request) just returns the single feature with an ID of 1001 (which is why we shortcut and use next(…) here instead of iterating over the results).

Now, here’s the trap: calling getFeatures is expensive. If you call it on a vector layer, QGIS will be required to setup an new connection to the data store (the layer provider), create some query to return data, and parse each result as it is returned from the provider. This can be slow, especially if you’re working with some type of remote layer, such as a PostGIS table over a VPN connection. This brings us to our first trap:

Trap #1: Minimise the calls to getFeatures()

A common task in PyQGIS code is to take a list of feature IDs and then request those features from the layer. A see a lot of older code which does this using something like:

for id in some_list_of_feature_ids:
    request = QgsFeatureRequest().setFilterFid(id)
    feature = next(vector_layer.getFeatures(request))
    # do something with the feature

Why is this a bad idea? Well, remember that every time you call getFeatures() QGIS needs to do a whole bunch of things before it can start giving you the matching features. In this case, the code is calling getFeatures() once for every feature ID in the list. So if the list had 100 features, that means QGIS is having to create a connection to the data source, set up and prepare a query to match a single feature, wait for the provider to process that, and then finally parse the single feature result. That’s a lot of wasted processing!

If the code is rewritten to take the call to getFeatures() outside of the loop, then the result is:

request = QgsFeatureRequest().setFilterFids(some_list_of_feature_ids)
for feature in vector_layer.getFeatures(request):
    # do something with the feature

Now there’s just a single call to getFeatures() here. QGIS optimises this request by using a single connection to the data source, preparing the query just once, and fetching the results in appropriately sized batches. The difference is huge, especially if you’re dealing with a large number of features.

Trap #2: Use QgsFeatureRequest filters appropriately

Here’s another common mistake I see in PyQGIS code. I often see this one when an author is trying to do something with all the selected features in a layer:

for feature in vector_layer.getFeatures():
    if not feature.id() in vector_layer.selectedFeaturesIds():
        continue

    # do something with the feature

What’s happening here is that the code is iterating over all the features in the layer, and then skipping over any which aren’t in the list of selected features. See the problem here? This code iterates over EVERY feature in the layer. If you’re layer has 10 million features, we are fetching every one of these from the data source, going through all the work of parsing it into a QGIS feature, and then promptly discarding it if it’s not in our list of selected features. It’s very inefficient, especially if fetching features is slow (such as when connecting to a remote database source).

Instead, this code should use the setFilterFids() method for QgsFeatureRequest:

request = QgsFeatureRequest().setFilterFids(vector_layer.selectedFeaturesIds())
for feature in vector_layer.getFeatures(request):
    # do something with the feature

Now, QGIS will only fetch features from the provider with matching feature IDs from the list. Instead of fetching and processing every feature in the layer, only the actual selected features will be fetched. It’s not uncommon to see operations which previously took many minutes (or hours!) drop down to a few seconds after applying this fix.

Another variant of this trap uses expressions to test the returned features:

filter_expression = QgsExpression('my_field > 20')
for feature in vector_layer.getFeatures():
    if not filter_expression.evaluate(feature):
        continue

    # do something with the feature

Again, this code is fetching every single feature from the layer and then discarding it if it doesn’t match the “my_field > 20” filter expression. By rewriting this to:

request = QgsFeatureRequest().setFilterExpression('my_field > 20')
for feature in vector_layer.getFeatures(request):
    # do something with the feature

we hand over the bulk of the filtering to the data source itself. Recent QGIS versions intelligently translate the filter into a format which can be applied directly at the provider, meaning that any relevant indexes and other optimisations can be applied by the provider itself. In this case the rewritten code means that ONLY the features matching the ‘my_field > 20’ criteria are fetched from the provider – there’s no time wasted messing around with features we don’t need.

 

Trap #3: Only request values you need

The last trap I often see is that more values are requested from the layer then are actually required. Let’s take the code:

my_sum = 0
for feature in vector_layer.getFeatures(request):
    my_sum += feature['value']

In this case there’s no way we can optimise the filters applied, since we need to process every feature in the layer. But – this code is still inefficient. By default QGIS will fetch all the details for a feature from the provider. This includes all attribute values and the feature’s geometry. That’s a lot of processing – QGIS needs to transform the values from their original format into a format usable by QGIS, and the feature’s geometry needs to be parsed from it’s original type and rebuilt as a QgsGeometry object. In our sample code above we aren’t doing anything with the geometry, and we are only using a single attribute from the layer. By calling setFlags( QgsFeatureRequest.NoGeometry ) and setSubsetOfAttributes() we can tell QGIS that we don’t need the geometry, and we only require a single attribute’s value:

my_sum = 0
request = QgsFeatureRequest().setFlags(QgsFeatureRequest.NoGeometry).setSubsetOfAttributes(['value'], vector_layer.fields() )
for feature in vector_layer.getFeatures(request):
    my_sum += feature['value']

None of the unnecessary geometry parsing will occur, and only the ‘value’ attribute will be fetched and populated in the features. This cuts down both on the processing required AND the amount of data transfer between the layer’s provider and QGIS. It’s a significant improvement if you’re dealing with larger layers.

Conclusion

Optimising your feature requests is one of the easiest ways to speed up your PyQGIS script! It’s worth spending some time looking over all your uses of getFeatures() to see whether you can cut down on what you’re requesting – the results can often be mind blowing!

Tagged , , , ,

Two book recommendations

I recently finished reading two books which may be of interest to open-source GIS users – “PostGIS Cookbook” and “The PyQGIS Programmer’s Guide“, both of which I highly recommend:

PostGIS Cookbook

PostGIS CookbookI’ve been a fan of Stephen Mather’s blog for a while now, and have consistently found it to be a great source of trustworthy information and creative solutions to GIS problems. So when I first saw mention of his work on the PostGIS Cookbook I knew it would be a must-read for me. PostGIS is an essential part of my daily toolkit, and I’ll quickly devour any tutorial or guide which can lead me to better ways to put it to use. And that’s exactly what this book is! It’s full of tips and guides which has inspired me in a lot of techniques I’d never tried or even thought possible in PostGIS.

It’s important to point out that this book isn’t a training manual or beginner’s guide to PostGIS. It assumes readers are already familiar with using PostGIS and have a good understanding of GIS software in general. (If you’re looking for a book to start from scratch with PostGIS, PostGIS in Action is a better fit). I think that’s really what makes this book stand out though. There’s currently not a lot of books available covering PostGIS, and as far as I’m aware the PostGIS Cookbook is the only book available which is targeted to experienced PostGIS users.

Highlights for me are:

  • A great explanation and write up on optimised KNN filtering in PostGIS (something which often trips me up)
  • The detailed guide to topologically correct simplification of features
  • The exploration of PgRouting, which is a great introduction to PostGIS’ routing abilities
  • The “PostGIS and the web” chapter – I really wasn’t expecting this, but it’s quite eye opening (I’m going to have to do some digging into GeoDjango sometime)

The only criticism I have with this book is that it jumps around a lot between operating systems. While most of the code is provided for both Linux/OSX and Windows, there’s occasional examples which only have code for one specific operating system. It’s a little jarring and assumes the user is well versed in their particular operating system to workaround these omissions.

Overall, I strongly recommend the PostGIS Cookbook, and would consider it a must have for anyone serious about expanding their PostGIS abilities. (Also, looks like the publisher, Packt, have a two-for-one sale going at the moment, so it’s a good time to grab this title).

The PyQGIS Programmer’s Guide

The PyQGIS Programmer's GuideThe second book I’ve just finished reading is Gary Sherman’s “The PyQGIS Programmer’s Guide“. For those who are unaware, Gary was the original founder of QGIS back in 2002, so you can be confident that he knows exactly what he’s writing about. In The PyQGIS Programmer’s Guide  Gary has created an in-depth guide on how to get started with programming for QGIS using python. It takes readers all the way from simple scripts right through to developing QGIS plugins and standalone applications based on the QGIS API.

This book fills an important void in the literature available for QGIS. Previously, the PyQGIS Developer Cookbook was the only available guide for QGIS python scripting, and unfortunately it’s a little out-of-date now. PyQGIS scripting can be a steep learning curve and that’s why this book is so appreciated.

It would be valuable to have some python knowledge and experience prior to reading this book. While the “Python Basics” chapter quickly runs through an introduction to the language, the book makes no claims to be a comprehensive python tutorial. But if you’ve dabbled in the language before and have familiarity with the python way of doing things you’ll easily be able to follow along.

Highlights are:

  • The “Tips and Techniques” chapter, which is a great mini-reference for performing a range of common tasks in PyQGIS (including loading layers, changing symbol styles, editing feature attributes, etc).
  • A complete tutorial for creating a QGIS plugin
  • A guide to debugging PyQGIS code and plugins

I’d definitely recommend that anyone who wants to get started with PyQGIS start with Gary’s work – you’ll find it the perfect place to begin.

Tagged , , , , ,