Python performance tips and best practice for CLIMADA developers#

This guide covers the following recommendations:

⏲️ Use profiling tools to find and assess performance bottlenecks.
🔁 Replace for-loops by built-in functions and efficient external implementations.
📝 Consider algorithmic performance, not only implementation performance.
🧊 Get familiar with NumPy: vectorized functions, slicing, masks and broadcasting.
⚫ Miscellaneous: sparse arrays, Numba, parallelization, huge files (xarray), memory.
⚠️ Don’t over-optimize at the expense of readability and usability.

Profiling#

Python comes with powerful packages for the performance assessment of your code. Within IPython and notebooks, there are several magic commands for this task:

%time: Time the execution of a single statement
%timeit: Time repeated execution of a single statement for more accuracy
%%timeit Does the same as %timeit for a whole cell
%prun: Run code with the profiler
%lprun: Run code with the line-by-line profiler
%memit: Measure the memory use of a single statement
%mprun: Run code with the line-by-line memory profiler

More information on profiling in the Python Data Science Handbook.

Also useful: unofficial Jupyter extension Execute Time.

While it’s easy to assess how fast or slow parts of your code are, including finding the bottlenecks, generating an improved version of it is much harder. This guide is about simple best practices that everyone should know who works with Python, especially when models are performance-critical.

In the following, we will focus on arithmetic operations because they play an important role in CLIMADA. Operations on non-numeric objects like strings, graphs, databases, file or network IO might be just as relevant inside and outside of the CLIMADA context. Some of the tips presented here do also apply to other contexts, but it’s always worth looking for context-specific performance guides.

General considerations#

This section will be concerned with:

🔁 for-loops and built-ins
📦 external implementations and converting data structures
📝 algorithmic efficiency
💾 memory usage

𝚺 As this section’s toy example, let’s assume we want to sum up all the numbers in a list:

list_of_numbers = list(range(10000))

for-loops#

A developer with a background in C++ would probably loop over the entries of the list:

%%timeit
result = 0
for i in list_of_numbers:
    result += i

332 µs ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The built-in function sum is much faster:

%timeit sum(list_of_numbers)

54.9 µs ± 5.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The timing improves by a factor of 5-6 and this is not a coincidence: for-loops generally tend to get prohibitively expensive when the number of iterations increases.

💡 When you have a for-loop with many iterations in your code, check for built-in functions or efficient external implementations of your programming task.

A special case worth noting are append operations on lists which can often be replaced by more efficient list comprehensions.

Converting data structures#

💡 When you find an external library that solves your task efficiently, always consider that it might be necessary to convert your data structure which takes time.

For arithmetic operations, NumPy is a great library, but if your data comes as a Python list, NumPy will spend quite some time converting it to a NumPy array:

import numpy as np
%timeit np.sum(list_of_numbers)

572 µs ± 80 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This operation is even slower than the for-loop!

However, if you can somehow obtain your data in the form of NumPy arrays from the start, or if you perform many operations that might compensate for the conversion time, the gain in performance can be considerable:

# do the conversion outside of the `%timeit`
ndarray_of_numbers = np.array(list_of_numbers)
%timeit np.sum(ndarray_of_numbers)

10.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Indeed, this is 5-6 times faster than the built-in sum and 20-30 times faster than the for-loop.

Always consider several implementations#

Even for such a basic task as summing, there exist several implementations whose performance can vary more than you might expect:

%timeit ndarray_of_numbers.sum()
%timeit np.einsum("i->", ndarray_of_numbers)

9.07 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.55 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This is up to 50 times faster than the for-loop. More information about the einsum function will be given in the NumPy section of this guide.

Efficient algorithms#

💡 Consider algorithmic performance, not only implementation performance.

All of the examples above do exactly the same thing, algorithmically. However, often the largest performance improvements can be obtained from algorithmic changes. This is the case when your model or your data contain symmetries or more complex structure that allows you to skip or boil down arithmetic operations.

In our example, we are summing the numbers from 1 to 10,000 and it’s a well known mathematical theorem that this can be done using only two multiplications and an increment:

n = max(list_of_numbers)
%timeit 0.5 * n * (n + 1)

83.1 ns ± 2.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Not surprisingly, This is almost 100 times faster than even the fastest implementation of the 10,000 summing operations listed above.

You don’t need a degree in maths to find algorithmic improvements. Other algorithmic improvements that are often easy to detect are:

Filter your data set as much as possible to perform operations only on those entries that are really relevant.
Example: When computing a physical hazard (e.g. extreme wind) with CLIMADA, restrict to Centroids on land unless you know that some of your exposure is off shore.
Make sure to detect inconsistent or trivial input parameters early on, before starting any operations.
Example: If your code does some complicated stuff and applies a user-provided normalization factor at the very end, make sure to check that the factor is not 0 before you start applying those complicated operations.

📝 In general: Before starting to code, take pen and paper and write down what you want to do from an algorithmic perspective.

Memory usage#

💡 Be careful with deep copies of large data sets and only load portions of large files into memory as needed.

Write your code in such a way that you handle large amounts of data chunk by chunk so that Python does not need to load everything into memory before performing any operations. When you do, Python’s generators might help you with the implementation.

🐌 Allocating unnecessary amounts of memory might slow down your code substantially due to swapping.

Miscellaneous#

Sparse matrices#

In many contexts, we deal with sparse matrices or sparse data structures, i.e. two-dimensional arrays where most of the entries are 0. In CLIMADA, this is especially the case for the intensity attributes of Hazard objects. This kind of data is usually handled using SciPy’s submodule scipy.sparse.

💡 When dealing with sparse matrices make sure that you always understand exactly which of your variables are sparse and which are dense and only switch from sparse to dense when absolutely necessary.

💡 Multiplications (multiply) and matrix multiplications (dot) are often faster than operations that involve masks or indexing.

As an example for the last rule, consider the problem of multiplying certain rows of a sparse array by a scalar:

import scipy.sparse as sparse

array = np.tile(np.array([0, 0, 0, 2, 0, 0, 0, 1, 0], dtype=np.float64), (100, 80))
row_mask = np.tile(np.array([False, False, True, False, True], dtype=bool), (20,))

In the following cells, note that the code in the first line after the %%timeit statement is not timed, it’s the setup line.

%%timeit sparse_array = sparse.csr_matrix(array)
sparse_array[row_mask, :] *= 5

/home/tovogt/.local/share/miniconda3/envs/tc/lib/python3.7/site-packages/scipy/sparse/data.py:55: RuntimeWarning: overflow encountered in multiply
  self.data *= other

1.52 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit sparse_array = sparse.csr_matrix(array)
sparse_array.multiply(np.where(row_mask, 5, 1)[:, None]).tocsr()

340 µs ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit sparse_array = sparse.csr_matrix(array)
sparse.diags(np.where(row_mask, 5, 1)).dot(sparse_array)

400 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Fast for-loops using Numba#

As a last resort, if there’s no way to avoid a for-loop even with NumPy’s vectorization capabilities, you can use the @njit decorator provided by the Numba package:

from numba import njit

@njit
def sum_array(arr):
    result = 0.0
    for i in range(arr.shape[0]):
        result += arr[i]
    return result

In fact, the Numba function is more than 100 times faster than without the decorator:

input_arr = np.float64(np.random.randint(low=0, high=10, size=(10000,)))

%timeit sum_array(input_arr)

10.9 µs ± 444 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# Call the function without the @njit
%timeit sum_array.py_func(input_arr)

1.84 ms ± 65.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, whenever available, NumPy’s own vectorized functions will usually be faster than Numba.

%timeit np.sum(input_arr)
%timeit input_arr.sum()
%timeit np.einsum("i->", input_arr)

6 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
27 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
89 µs ± 499 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

💡 Make sure you understand the basic idea behind Numba before using it, read the Numba docs.

💡 Don’t use @jit, but use @njit which is an alias for @jit(nopython=True).

When you know what you are doing, the fastmath and parallel options can boost performance even further: read more about this in the Numba docs.

Parallelizing tasks#

Depending on your hardware setup, parallelizing tasks using pathos and Numba’s automatic parallelization feature can improve the performance of your implementation.

💡 Expensive hardware is no excuse for inefficient code.

Many tasks in CLIMADA could profit from GPU implementations. However, currently there are no plans to include GPU support in CLIMADA because of the considerable development and maintenance workload that would come with it. If you want to change this, contact the core team of developers, open an issue or mention it in the bi-weekly meetings.

Read NetCDF datasets with `xarray`#

When dealing with NetCDF datasets, memory is often an issue, because even if the file is only a few megabytes in size, the uncompressed raw arrays contained within can be several gigabytes large (especially when data is sparse or similarly structured). One way of dealing with this situation is to open the dataset with xarray.

💡 xarray allows to read the shape and type of variables contained in the dataset without loading any of the actual data into memory.

Furthermore, when loading slices and arithmetically aggregating variables, memory is allocated not more than necessary, but values are obtained on-the-fly from the file.

Take-home messages#

We conclude by repeating the gist of this guide:

Python performance tips and best practice for CLIMADA developers

Contents