{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Python performance tips and best practice for CLIMADA developers\n",
    "\n",
    "This guide covers the following recommendations:\n",
    "\n",
    "⏲️ **Use profiling tools** to find and assess performance bottlenecks.<br />\n",
    "🔁 **Replace for-loops** by built-in functions and efficient external implementations.<br />\n",
    "📝 **Consider algorithmic performance**, not only implementation performance.<br />\n",
    "🧊 **Get familiar with NumPy:** vectorized functions, slicing, masks and broadcasting.<br />\n",
    "⚫ **Miscellaneous:** sparse arrays, Numba, parallelization, huge files (xarray), memory.<br />\n",
    "⚠️ **Don't over-optimize** at the expense of readability and usability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#1-Profiling\" data-toc-modified-id=\"1-Profiling-1\">1 Profiling</a></span></li><li><span><a href=\"#2-General-considerations\" data-toc-modified-id=\"2-General-considerations-2\">2 General considerations</a></span><ul class=\"toc-item\"><li><span><a href=\"#2.1-for-loops\" data-toc-modified-id=\"2.1-for-loops-2.1\">2.1 for-loops</a></span></li><li><span><a href=\"#2.2-Converting-data-structures\" data-toc-modified-id=\"2.2-Converting-data-structures-2.2\">2.2 Converting data structures</a></span></li><li><span><a href=\"#2.3-Always-consider-several-implementations\" data-toc-modified-id=\"2.3-Always-consider-several-implementations-2.3\">2.3 Always consider several implementations</a></span></li><li><span><a href=\"#2.4-Efficient-algorithms\" data-toc-modified-id=\"2.4-Efficient-algorithms-2.4\">2.4 Efficient algorithms</a></span></li><li><span><a href=\"#2.5-Memory-usage\" data-toc-modified-id=\"2.5-Memory-usage-2.5\">2.5 Memory usage</a></span></li></ul></li><li><span><a href=\"#3-NumPy-related-tips-and-best-practice\" data-toc-modified-id=\"3-NumPy-related-tips-and-best-practice-3\">3 NumPy-related tips and best practice</a></span><ul class=\"toc-item\"><li><span><a href=\"#3.1-Vectorized-functions\" data-toc-modified-id=\"3.1-Vectorized-functions-3.1\">3.1 Vectorized functions</a></span></li><li><span><a href=\"#3.2-Broadcasting\" data-toc-modified-id=\"3.2-Broadcasting-3.2\">3.2 Broadcasting</a></span></li><li><span><a href=\"#3.3-A-note-on-in-place-operations\" data-toc-modified-id=\"3.3-A-note-on-in-place-operations-3.3\">3.3 A note on in-place operations</a></span></li></ul></li><li><span><a href=\"#4-Miscellaneous\" data-toc-modified-id=\"4-Miscellaneous-4\">4 Miscellaneous</a></span><ul class=\"toc-item\"><li><span><a href=\"#4.1-Sparse-matrices\" data-toc-modified-id=\"4.1-Sparse-matrices-4.1\">4.1 Sparse matrices</a></span></li><li><span><a href=\"#4.2-Fast-for-loops-using-Numba\" data-toc-modified-id=\"4.2-Fast-for-loops-using-Numba-4.2\">4.2 Fast for-loops using Numba</a></span></li><li><span><a href=\"#4.3-Parallelizing-tasks\" data-toc-modified-id=\"4.3-Parallelizing-tasks-4.3\">4.3 Parallelizing tasks</a></span></li><li><span><a href=\"#4.4-Read-NetCDF-datasets-with-xarray\" data-toc-modified-id=\"4.4-Read-NetCDF-datasets-with-xarray-4.4\">4.4 Read NetCDF datasets with <code>xarray</code></a></span></li></ul></li><li><span><a href=\"#5-Take-home-messages\" data-toc-modified-id=\"5-Take-home-messages-5\">5 Take-home messages</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 1 Profiling\n",
    "Python comes with powerful packages for the **performance assessment** of your code. Within IPython and notebooks, there are several magic commands for this task:\n",
    "\n",
    "* `%time`: Time the execution of a single statement\n",
    "* `%timeit`: Time repeated execution of a single statement for more accuracy\n",
    "* `%%timeit` Does the same as `%timeit` for a whole cell\n",
    "* `%prun`: Run code with the profiler\n",
    "* `%lprun`: Run code with the line-by-line profiler\n",
    "* `%memit`: Measure the memory use of a single statement\n",
    "* `%mprun`: Run code with the line-by-line memory profiler\n",
    "\n",
    "More information on profiling in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html).\n",
    "\n",
    "Also useful: unofficial Jupyter extension [Execute Time](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html).\n",
    "\n",
    "While it's easy to assess how fast or slow parts of your code are, including finding the bottlenecks, **generating an improved version of it is much harder**. This guide is about **simple best practices** that everyone should know who works with Python, especially when models are performance-critical.\n",
    "\n",
    "In the following, we will **focus on arithmetic operations** because they play an important role in CLIMADA. Operations on non-numeric objects like strings, graphs, databases, file or network IO might be just as relevant inside and outside of the CLIMADA context. Some of the tips presented here do also apply to other contexts, but **it's always worth looking for context-specific performance guides**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 2 General considerations\n",
    "This section will be concerned with:\n",
    "\n",
    "🔁 **for-loops** and built-ins<br />\n",
    "📦 **external implementations** and converting data structures<br />\n",
    "📝 **algorithmic efficiency**<br />\n",
    "💾 **memory usage**\n",
    "\n",
    "𝚺 As this section's toy example, let's assume we want to sum up all the numbers in a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list_of_numbers = list(range(10000))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2.1 for-loops\n",
    "\n",
    "A developer with a background in C++ would probably loop over the entries of the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "332 µs ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "result = 0\n",
    "for i in list_of_numbers:\n",
    "    result += i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The built-in function `sum` is much faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "54.9 µs ± 5.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit sum(list_of_numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The timing improves by a factor of 5-6 and this is not a coincidence: **for-loops generally tend to get prohibitively expensive** when the number of iterations increases.\n",
    "\n",
    "**💡 When you have a for-loop with many iterations in your code, check for built-in functions or efficient external implementations of your programming task.**\n",
    "\n",
    "A special case worth noting are `append` operations on lists which can often be replaced by more efficient *list comprehensions*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2.2 Converting data structures\n",
    "\n",
    "💡 **When you find an external library that solves your task efficiently, always consider that it might be necessary to convert your data structure which takes time.**\n",
    "\n",
    "For arithmetic operations, NumPy is a great library, but if your data comes as a Python list, NumPy will spend quite some time converting it to a NumPy array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "572 µs ± 80 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "%timeit np.sum(list_of_numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This operation is even slower than the for-loop!\n",
    "\n",
    "However, if you can somehow obtain your data in the form of **NumPy arrays from the start**, or if you perform many operations that might compensate for the conversion time, the gain in performance can be considerable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "# do the conversion outside of the `%timeit`\n",
    "ndarray_of_numbers = np.array(list_of_numbers)\n",
    "%timeit np.sum(ndarray_of_numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, this is 5-6 times faster than the built-in `sum` and 20-30 times faster than the for-loop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2.3 Always consider several implementations\n",
    "\n",
    "Even for such a basic task as summing, there exist several implementations whose performance can vary more than you might expect:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.07 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n",
      "5.55 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit ndarray_of_numbers.sum()\n",
    "%timeit np.einsum(\"i->\", ndarray_of_numbers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is up to 50 times faster than the for-loop. More information about the `einsum` function will be given in the NumPy section of this guide."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2.4 Efficient algorithms\n",
    "\n",
    "💡 **Consider algorithmic performance, not only implementation performance.**\n",
    "\n",
    "All of the examples above do exactly the same thing, algorithmically. However, often the largest performance improvements can be obtained from **algorithmical changes**. This is the case when your model or your data contain symmetries or more complex structure that allows you to skip or boil down arithmetic operations.\n",
    "\n",
    "In our example, we are summing the numbers from 1 to 10,000 and it's a well known mathematical theorem that this can be done using only two multiplications and an increment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "83.1 ns ± 2.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)\n"
     ]
    }
   ],
   "source": [
    "n = max(list_of_numbers)\n",
    "%timeit 0.5 * n * (n + 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not surprisingly, This is almost 100 times faster than even the fastest implementation of the 10,000 summing operations listed above.\n",
    "\n",
    "You don't need a degree in maths to find algorithmical improvements. Other algorithmical improvements that are often easy to detect are:\n",
    "* **Filter your data set as much as possible** to perform operations only on those entries that are really relevant. <br> ***Example:*** When computing a physical hazard (e.g. extreme wind) with CLIMADA, restrict to Centroids on land unless you know that some of your exposure is off shore.\n",
    "* Make sure to **detect inconsistent or trivial input parameters early on**, before starting any operations. <br> ***Example:*** If your code does some complicated stuff and applies a user-provided normalization factor at the very end, make sure to check that the factor is not 0 before you start applying those complicated operations.\n",
    "\n",
    "📝 **In general: Before starting to code, take pen and paper and write down what you want to do from an algorithmic perspective.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2.5 Memory usage\n",
    "💡 **Be careful with deep copies of large data sets and only load portions of large files into memory as needed.**\n",
    "\n",
    "Write your code in such a way that you **handle large amounts of data chunk by chunk** so that Python does not need to load everything into memory before performing any operations. When you do, Python's [generators](https://wiki.python.org/moin/Generators) might help you with the implementation.\n",
    "\n",
    "🐌 Allocating unnecessary amounts of memory might slow down your code substantially due to swapping."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 3 NumPy-related tips and best practice\n",
    "\n",
    "As mentioned above, arithmetic operations in Python can profit a lot from NumPy's capabilities. In this section, we collect some tips how to make use of NumPy's capabilities when performance is an issue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 3.1 Vectorized functions\n",
    "We mentioned above that Python's **for-loops are really slow**. This is even more important when looping over the entries in a NumPy array. Fortunately, NumPy's masks, slicing notation and vectorization capabilities help to avoid for-loops in almost every possible situation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# TASK: compute the column-sum of a 2-dimensional array\n",
    "input_arr = np.random.rand(100, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "145 µs ± 5.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "# SLOW: summing over columns using loops\n",
    "output = np.zeros(100)\n",
    "for row_i in range(input_arr.shape[0]):\n",
    "    for col_i in range(input_arr.shape[1]):\n",
    "        output[row_i] += input_arr[row_i, col_i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.23 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "# FASTER: using NumPy's vectorized `sum` function with `axis` attribute\n",
    "%timeit output = input_arr.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In the special case of multiplications and sums (linear operations) over the axes of two multi-dimensional arrays, NumPy's **`einsum`** is even faster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.38 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit output = np.einsum(\"ij->i\", input_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Another `einsum` example: **Euclidean norms**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24.4 µs ± 2.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "26.5 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n",
      "9.5 µs ± 91.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "many_vectors = np.random.rand(1000, 3)\n",
    "%timeit np.sqrt((many_vectors**2).sum(axis=1))\n",
    "%timeit np.linalg.norm(many_vectors, axis=1)\n",
    "%timeit np.sqrt(np.einsum(\"...j,...j->...\", many_vectors, many_vectors))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information about the capabilities of NumPy's `einsum` function, refer to [the official NumPy documentation](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html). However, note that future releases of NumPy will eventually improve the performance of core functions, so that `einsum` will become an example of over-optimization (see above) at some point. Whenever you use `einsum`, consider adding a comment that explains what it does for users that are not familiar with `einsum`'s syntax.\n",
    "\n",
    "Not only `sum`, but many NumPy functions come with similar vectorization capabilities. You can take minima, maxima, means or standard deviations along selected axes. But did you know that the same is true for the `diff` and `argmin` functions?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4, 2, 6],\n",
       "       [2, 3, 4],\n",
       "       [3, 3, 3],\n",
       "       [3, 2, 4]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arr = np.random.randint(low=0, high=10, size=(4, 3))\n",
    "arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 0, 1])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arr.argmin(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 3.2 Broadcasting\n",
    "\n",
    "When operations are performed on several arrays, possibly of differing shapes, be sure to use NumPy's **broadcasting** capabilities. This will save you a lot of memory and time when performing arithmetic operations.\n",
    "\n",
    "***Example:*** We want to multiply the columns of a two-dimensional array by values stored in a one-dimensional array. There are two naive approaches to this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_arr = np.random.rand(100, 3)\n",
    "col_factors = np.random.rand(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.67 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "# SLOW: stack/tile the one-dimensional array to be two-dimensional\n",
    "%timeit output = np.tile(col_factors, (input_arr.shape[0], 1)) * input_arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.63 µs ± 95.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "# SLOW: loop over columns and factors\n",
    "output = input_arr.copy()\n",
    "for i, factor in enumerate(col_factors):\n",
    "    output[:, i] *= factor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "The idea of *broadcasting* is that NumPy **automatically matches axes from right to left and implicitly repeats data along missing axes** if necessary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.41 µs ± 51.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit output = col_factors * input_arr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "For automatic broadcasting, the *trailing* dimensions of two arrays have to match. NumPy is matching the shapes of the arrays *from right to left*. If you happen to have arrays where other dimensions match, **you have to tell NumPy which dimensions to add by adding an axis of length 1 for each missing dimension**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_arr = np.random.rand(3, 100)\n",
    "row_factors = np.random.rand(3)\n",
    "output = row_factors.reshape(3, 1) * input_arr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Because this concept is so important, there is a short-hand notation for adding an axis of length 1. In the slicing notation, **add `None` in those positions where broadcasting should take place**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_arr = np.random.rand(3, 100)\n",
    "row_factors = np.random.rand(3)\n",
    "output = row_factors[:, None] * input_arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "input_arr = np.random.rand(7, 3, 5, 4, 6)\n",
    "factors = np.random.rand(7, 3, 4)\n",
    "output = factors[:, :, None, :, None] * input_arr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 3.3 A note on in-place operations\n",
    "While **in-place operations** are generally faster than long and explicit expressions, they shouldn't be over-estimated when looking for performance bottlenecks. Often, the loss in code readability is not justified because NumPy's memory management is really fast.\n",
    "\n",
    "💡 **Don't over-optimize!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "shape = (1200, 1700)\n",
    "arr_a = np.random.rand(*shape)\n",
    "arr_b = np.random.rand(*shape)\n",
    "arr_c = np.random.rand(*shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "17.3 ms ± 820 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "# long expression in one line\n",
    "%timeit arr_d = arr_c * (arr_a + arr_b) - arr_a + arr_c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "17.4 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "# almost same performance: in-place operations\n",
    "arr_d = arr_a + arr_b\n",
    "arr_d *= arr_c\n",
    "arr_d -= arr_a\n",
    "arr_d += arr_c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "%load_ext memory_profiler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 156.68 MiB, increment: 31.20 MiB\n"
     ]
    }
   ],
   "source": [
    "# long expression in one line\n",
    "%memit arr_d = arr_c * (arr_a + arr_b) - arr_a + arr_c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "peak memory: 157.27 MiB, increment: 0.00 MiB\n"
     ]
    }
   ],
   "source": [
    "%%memit\n",
    "# almost same memory usage: in-place operations\n",
    "arr_d = arr_a + arr_b\n",
    "arr_d *= arr_c\n",
    "arr_d -= arr_a\n",
    "arr_d += arr_c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 Miscellaneous"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 4.1 Sparse matrices\n",
    "\n",
    "In many contexts, we deal with sparse matrices or sparse data structures, i.e. two-dimensional arrays where most of the entries are 0. In CLIMADA, this is especially the case for the `intensity` attributes of `Hazard` objects. This kind of data is usually handled using SciPy's submodule [scipy.sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html).\n",
    "\n",
    "💡 **When dealing with sparse matrices make sure that you always understand exactly which of your variables are sparse and which are dense and only switch from sparse to dense when absolutely necessary.**\n",
    "\n",
    "💡 **Multiplications (`multiply`) and matrix multiplications (`dot`) are often faster than operations that involve masks or indexing.**\n",
    "\n",
    "As an example for the last rule, consider the problem of multiplying certain rows of a sparse array by a scalar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy.sparse as sparse\n",
    "\n",
    "array = np.tile(np.array([0, 0, 0, 2, 0, 0, 0, 1, 0], dtype=np.float64), (100, 80))\n",
    "row_mask = np.tile(np.array([False, False, True, False, True], dtype=bool), (20,))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following cells, note that the code in the first line after the `%%timeit` statement is not timed, it's the setup line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/tovogt/.local/share/miniconda3/envs/tc/lib/python3.7/site-packages/scipy/sparse/data.py:55: RuntimeWarning: overflow encountered in multiply\n",
      "  self.data *= other\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.52 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit sparse_array = sparse.csr_matrix(array)\n",
    "sparse_array[row_mask, :] *= 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "340 µs ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit sparse_array = sparse.csr_matrix(array)\n",
    "sparse_array.multiply(np.where(row_mask, 5, 1)[:, None]).tocsr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "400 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit sparse_array = sparse.csr_matrix(array)\n",
    "sparse.diags(np.where(row_mask, 5, 1)).dot(sparse_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 4.2 Fast for-loops using Numba\n",
    "As a last resort, if there's no way to avoid a for-loop even with NumPy's vectorization capabilities, you can use the `@njit` decorator provided by the Numba package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import njit\n",
    "\n",
    "@njit\n",
    "def sum_array(arr):\n",
    "    result = 0.0\n",
    "    for i in range(arr.shape[0]):\n",
    "        result += arr[i]\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In fact, the Numba function is more than 100 times faster than without the decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_arr = np.float64(np.random.randint(low=0, high=10, size=(10000,)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10.9 µs ± 444 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit sum_array(input_arr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.84 ms ± 65.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "# Call the function without the @njit\n",
    "%timeit sum_array.py_func(input_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "However, whenever available, **NumPy's own vectorized functions will usually be faster than Numba**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.6 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n",
      "5.27 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n",
      "7.89 µs ± 499 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit np.sum(input_arr)\n",
    "%timeit input_arr.sum()\n",
    "%timeit np.einsum(\"i->\", input_arr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "💡 **Make sure you understand the basic idea behind Numba before using it, read the [Numba docs](https://numba.readthedocs.io/en/stable/user/5minguide.html).**\n",
    "\n",
    "💡 **Don't use `@jit`, but use `@njit` which is an alias for `@jit(nopython=True)`.**\n",
    "\n",
    "When you know what you are doing, the `fastmath` and `parallel` options can boost performance even further: read more about this in the [Numba docs](https://numba.pydata.org/numba-doc/dev/user/performance-tips.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 4.3 Parallelizing tasks\n",
    "\n",
    "Depending on your hardware setup, parallelizing tasks using [pathos](https://pathos.readthedocs.io/en/latest/pathos.html) and [Numba's automatic parallelization feature](https://numba.readthedocs.io/en/stable/user/parallel.html) can improve the performance of your implementation.\n",
    "\n",
    "💡 **Expensive hardware is no excuse for inefficient code.**\n",
    "\n",
    "Many tasks in CLIMADA could profit from GPU implementations. However, currently there are **no plans to include GPU support in CLIMADA** because of the considerable development and maintenance workload that would come with it. If you want to change this, contact the core team of developers, open an issue or mention it in the bi-weekly meetings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 4.4 Read NetCDF datasets with `xarray`\n",
    "When dealing with NetCDF datasets, memory is often an issue, because even if the file is only a few megabytes in size, the uncompressed raw arrays contained within can be several gigabytes large (especially when data is sparse or similarly structured). One way of dealing with this situation is to open the dataset with `xarray`.\n",
    "\n",
    "💡 **`xarray` allows to read the shape and type of variables contained in the dataset without loading any of the actual data into memory.**\n",
    "\n",
    "Furthermore, when loading slices and arithmetically aggregating variables, memory is allocated not more than necessary, but values are obtained on-the-fly from the file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 5 Take-home messages\n",
    "\n",
    "We conclude by repeating the gist of this guide:\n",
    "\n",
    "⏲️ **Use profiling tools** to find and assess performance bottlenecks.<br />\n",
    "🔁 **Replace for-loops** by built-in functions and efficient external implementations.<br />\n",
    "📝 **Consider algorithmic performance**, not only implementation performance.<br />\n",
    "🧊 **Get familiar with NumPy:** vectorized functions, slicing, masks and broadcasting.<br />\n",
    "⚫ **Miscellaneous:** sparse arrays, Numba, parallelization, huge files (xarray), memory.<br />\n",
    "⚠️ **Don't over-optimize** at the expense of readability and usability."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "384px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}