Impact Function Calibration#

CLIMADA provides the climada.util.calibrate module for calibrating impact functions based on impact data. This tutorial will guide through the usage of this module by calibrating an impact function for tropical cyclones (TCs).

For further information on the classes available from the module, see its documentation.

Overview#

The basic idea of the calibration is to find a set of parameters for an impact function that minimizes the deviation between the calculated impact and some impact data. For setting up a calibration task, users have to supply the following information:

  • Hazard and Exposure (as usual, see the tutorial)

  • The impact data to calibrate the model to

  • An impact function definition depending on the calibrated parameters

  • Bounds and constraints of the calibrated parameters (depending on the calibration algorithm)

  • A “cost function” defining the single-valued deviation between impact data and calculated impact

  • A function for transforming the calculated impact into the same data structure as the impact data

This information defines the calibration task and is inserted into the Input object. Afterwards, the user may insert this object into one of the optimizer classes. Currently, the following classes are available:

The following tutorial walks through the input data preparation and the setup of a BayesianOptimizer instance for calibration. For a brief example, refer to Quickstart. If you want to go through a somewhat realistic calibration task step-by-step, continue with Calibration Data.

import logging
import climada

logging.getLogger("climada").setLevel("WARNING")
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/site-packages/dask/dataframe/_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 12.0.1. Please consider upgrading.
  warnings.warn(

Quickstart#

This section gives a very quick overview of assembling a calibration task. Here, we calibrate a single impact function for damage reports in Mexico (MEX) from a TC with IbtracsID 2010176N16278.

import pandas as pd
from sklearn.metrics import mean_squared_log_error

from climada.util import log_level
from climada.util.api_client import Client
from climada.entity import ImpactFuncSet, ImpfTropCyclone
from climada.util.calibrate import (
    Input,
    BayesianOptimizer,
    BayesianOptimizerController,
    OutputEvaluator,
)

# Load hazard and exposure from Data API
client = Client()
exposure = client.get_litpop("MEX")
exposure.gdf["impf_TC"] = 1
all_tcs = client.get_hazard(
    "tropical_cyclone",
    properties={"event_type": "observed", "spatial_coverage": "global"},
)
hazard = all_tcs.select(event_names=["2010176N16278"])

# Impact data (columns: region ID, index: hazard event ID)
data = pd.DataFrame(data=[[2.485465e09]], columns=[484], index=list(hazard.event_id))

# Create input
inp = Input(
    hazard=hazard,
    exposure=exposure,
    data=data,
    # Generate impact function from estimated parameters
    impact_func_creator=lambda v_half: ImpactFuncSet(
        [ImpfTropCyclone.from_emanuel_usa(v_half=v_half, impf_id=1)]
    ),
    # Estimated parameter bounds
    bounds={"v_half": (26, 100)},
    # Cost function
    cost_func=mean_squared_log_error,
    # Transform impact to pandas Dataframe with same structure as data
    impact_to_dataframe=lambda impact: impact.impact_at_reg(exposure.gdf["region_id"]),
)

# Set up optimizer (with controller)
controller = BayesianOptimizerController.from_input(inp)
opt = BayesianOptimizer(inp)

# Run optimization
with log_level("WARNING", "climada.engine.impact_calc"):
    output = opt.run(controller)

# Analyse results
output.plot_p_space()
out_eval = OutputEvaluator(inp, output)
out_eval.impf_set.plot()

# Optimal value
output.params
2024-07-08 16:36:14,126 - climada.entity.exposures.base - INFO - Reading /Users/ldr.riedel/climada/data/exposures/litpop/LitPop_150arcsec_MEX/v3/LitPop_150arcsec_MEX.hdf5
2024-07-08 16:36:20,330 - climada.hazard.base - INFO - Reading /Users/ldr.riedel/climada/data/hazard/tropical_cyclone/tropical_cyclone_0synth_tracks_150arcsec_global_1980_2020/v2/tropical_cyclone_0synth_tracks_150arcsec_global_1980_2020.hdf5
2024-07-08 16:36:28,741 - climada.entity.exposures.base - INFO - Matching 100369 exposures with 6125253 centroids.
2024-07-08 16:36:30,939 - climada.util.coordinates - INFO - No exact centroid match found. Reprojecting coordinates to nearest neighbor closer than the threshold = 100
2024-07-08 16:36:34,188 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-07-08 16:36:34,646 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-07-08 16:36:34,996 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-07-08 16:36:35,313 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-07-08 16:36:35,690 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
2024-07-08 16:36:35,733 - climada.entity.exposures.base - INFO - Exposures matching centroids already found for TC
2024-07-08 16:36:35,736 - climada.entity.exposures.base - INFO - Existing centroids will be overwritten for TC
2024-07-08 16:36:35,736 - climada.entity.exposures.base - INFO - Matching 100369 exposures with 6125253 centroids.
2024-07-08 16:36:37,907 - climada.util.coordinates - INFO - No exact centroid match found. Reprojecting coordinates to nearest neighbor closer than the threshold = 100
2024-07-08 16:36:41,611 - climada.engine.impact_calc - INFO - Calculating impact for 292230 assets (>0) and 1 events.
{'v_half': 48.30330549244917}
../_images/9bdb14f606d0b3879c74e34ae3cc4be76bc8a3262bc693d040094d8d1450df85.png ../_images/418bd7b02b8a3818fcd14b0c9b0dff7005128c30d50e915d0326ce69a99ac6cb.png

Follow the next sections of the tutorial for a more in-depth explanation.

Calibration Data#

CLIMADA ships data from the International Disaster Database EM-DAT, which we will use to calibrate impact functions on. In the first step, we will select TC events that caused damages in the NA1 basin since 2010.

We use EMDAT data from TCs occurring in the NA1 basin since 2010. We calculate the centroids for which we want to compute the windfields by extracting the countries hit by the cyclones and retrieving a LitPop exposure instance from them. We then use the exposure coordinates as centroids coordinates.

import pandas as pd
from climada.util.constants import SYSTEM_DIR

emdat = pd.read_csv(SYSTEM_DIR / "tc_impf_cal_v01_EDR.csv")
emdat_subset = emdat[(emdat.cal_region2 == "NA1") & (emdat.year >= 2010)]
emdat_subset
country region_id cal_region2 year EM_ID ibtracsID emdat_impact reference_year emdat_impact_scaled climada_impact ... scale log_ratio unique_ID Associated_disaster Surge Rain Flood Slide Other OtherThanSurge
326 MEX 484 NA1 2010 2010-0260 2010176N16278 2.000000e+09 2014 2.485465e+09 2.478270e+09 ... 1.0 -0.002899 2010-0260MEX True False False True False False True
331 ATG 28 NA1 2010 2010-0468 2010236N12341 1.260000e+07 2014 1.394594e+07 1.402875e+07 ... 1.0 0.005920 2010-0468ATG True False False True False False True
334 MEX 484 NA1 2010 2010-0494 2010257N16282 3.900000e+09 2014 4.846656e+09 4.857140e+09 ... 1.0 0.002161 2010-0494MEX True False False True False False True
339 LCA 662 NA1 2010 2010-0571 2010302N09306 5.000000e+05 2014 5.486675e+05 5.492871e+05 ... 1.0 0.001129 2010-0571LCA True False False True True False True
340 VCT 670 NA1 2010 2010-0571 2010302N09306 2.500000e+07 2014 2.670606e+07 2.676927e+07 ... 1.0 0.002364 2010-0571VCT False False False False False False False
344 BHS 44 NA1 2011 2011-0328 2011233N15301 4.000000e+07 2014 4.352258e+07 4.339898e+07 ... 1.0 -0.002844 2011-0328BHS False False False False False False False
345 DOM 214 NA1 2011 2011-0328 2011233N15301 3.000000e+07 2014 3.428317e+07 3.404744e+07 ... 1.0 -0.006900 2011-0328DOM True False False True False False True
346 PRI 630 NA1 2011 2011-0328 2011233N15301 5.000000e+08 2014 5.104338e+08 5.139659e+08 ... 1.0 0.006896 2011-0328PRI True False False True False False True
352 MEX 484 NA1 2011 2011-0385 2011279N10257 2.770000e+07 2014 3.084603e+07 3.077374e+07 ... 1.0 -0.002346 2011-0385MEX True False False True True False True
359 MEX 484 NA1 2012 2012-0276 2012215N12313 3.000000e+08 2014 3.283428e+08 3.284805e+08 ... 1.0 0.000419 2012-0276MEX False False False False False False False
365 MEX 484 NA1 2012 2012-0401 2012166N09269 5.550000e+08 2014 6.074341e+08 6.103254e+08 ... 1.0 0.004749 2012-0401MEX False False False False False False False
369 JAM 388 NA1 2012 2012-0410 2012296N14283 1.654200e+07 2014 1.548398e+07 1.548091e+07 ... 1.0 -0.000198 2012-0410JAM False False False False False False False
406 MEX 484 NA1 2014 2014-0333 2014253N13260 2.500000e+09 2014 2.500000e+09 2.501602e+09 ... 1.0 0.000640 2014-0333MEX True False False True False False True
427 MEX 484 NA1 2015 2015-0470 2015293N13266 8.230000e+08 2014 9.242430e+08 9.239007e+08 ... 1.0 -0.000370 2015-0470MEX True False False True False False True
428 CPV 132 NA1 2015 2015-0473 2015242N12343 1.100000e+06 2014 1.281242e+06 1.282842e+06 ... 1.0 0.001248 2015-0473CPV False False False False False False False
429 BHS 44 NA1 2015 2015-0479 2015270N27291 9.000000e+07 2014 8.362720e+07 1.129343e+06 ... 1.0 -4.304733 2015-0479BHS True True False True False False True
437 MEX 484 NA1 2016 2016-0319 2016248N15255 5.000000e+07 2014 6.098482e+07 6.088808e+07 ... 1.0 -0.001588 2016-0319MEX True False False True False False True
451 MEX 484 NA1 2017 2017-0334 2017219N16279 2.000000e+06 2014 2.284435e+06 2.285643e+06 ... 1.0 0.000529 2017-0334MEX True False False True True False True
456 ATG 28 NA1 2017 2017-0381 2017242N16333 2.500000e+08 2014 2.111764e+08 2.110753e+08 ... 1.0 -0.000479 2017-0381ATG False False False False False False False
457 BHS 44 NA1 2017 2017-0381 2017242N16333 2.000000e+06 2014 1.801876e+06 6.118379e+05 ... 1.0 -1.080116 2017-0381BHS False False False False False False False
458 CUB 192 NA1 2017 2017-0381 2017242N16333 1.320000e+10 2014 1.099275e+10 1.090424e+10 ... 1.0 -0.008085 2017-0381CUB True False False True False False True
459 KNA 659 NA1 2017 2017-0381 2017242N16333 2.000000e+07 2014 1.848489e+07 1.848665e+07 ... 1.0 0.000095 2017-0381KNA False False False False False False False
460 TCA 796 NA1 2017 2017-0381 2017242N16333 5.000000e+08 2014 5.000000e+08 5.003790e+08 ... 1.0 0.000758 2017-0381TCA True False False True False False True
462 VGB 92 NA1 2017 2017-0381 2017242N16333 3.000000e+09 2014 3.000000e+09 6.236200e+08 ... 1.0 -1.570826 2017-0381VGB False False False False False False False
463 DMA 212 NA1 2017 2017-0383 2017260N12310 1.456000e+09 2014 1.534596e+09 8.951186e+08 ... 1.0 -0.539066 2017-0383DMA True False False True True False True
464 DOM 214 NA1 2017 2017-0383 2017260N12310 6.300000e+07 2014 5.481371e+07 5.493466e+07 ... 1.0 0.002204 2017-0383DOM True False False True True False True
465 PRI 630 NA1 2017 2017-0383 2017260N12310 6.800000e+10 2014 6.700905e+10 6.702718e+10 ... 1.0 0.000271 2017-0383PRI True False False True False True True

27 rows × 22 columns

Each entry in the database refers to an economic impact for a specific country and TC event. The TC events are identified by the ID assigned from the International Best Track Archive for Climate Stewardship (IBTrACS). We now want to reshape this data so that impacts are grouped by event and country.

To achieve this, we iterate over the unique track IDs, select all reported damages associated with this ID, and concatenate the results. For missing entries, pandas will set the value to NaN. We assume that missing entries means that no damages are reported (this is a strong assumption), and set all NaN values to zero. Then, we transpose the dataframe so that each row represents an event and each column states the damage for a specific country. Finally, we set the track ID to be the index of the data frame.

track_ids = emdat_subset["ibtracsID"].unique()

data = pd.pivot_table(
    emdat_subset,
    values="emdat_impact_scaled",
    index="ibtracsID",
    columns="region_id",
    # fill_value=0,
)
data
region_id 28 44 92 132 192 212 214 388 484 630 659 662 670 796
ibtracsID
2010176N16278 NaN NaN NaN NaN NaN NaN NaN NaN 2.485465e+09 NaN NaN NaN NaN NaN
2010236N12341 1.394594e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010257N16282 NaN NaN NaN NaN NaN NaN NaN NaN 4.846656e+09 NaN NaN NaN NaN NaN
2010302N09306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 548667.5019 26706058.15 NaN
2011233N15301 NaN 4.352258e+07 NaN NaN NaN NaN 34283168.75 NaN NaN 5.104338e+08 NaN NaN NaN NaN
2011279N10257 NaN NaN NaN NaN NaN NaN NaN NaN 3.084603e+07 NaN NaN NaN NaN NaN
2012166N09269 NaN NaN NaN NaN NaN NaN NaN NaN 6.074341e+08 NaN NaN NaN NaN NaN
2012215N12313 NaN NaN NaN NaN NaN NaN NaN NaN 3.283428e+08 NaN NaN NaN NaN NaN
2012296N14283 NaN NaN NaN NaN NaN NaN NaN 15483975.86 NaN NaN NaN NaN NaN NaN
2014253N13260 NaN NaN NaN NaN NaN NaN NaN NaN 2.500000e+09 NaN NaN NaN NaN NaN
2015242N12343 NaN NaN NaN 1281242.483 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015270N27291 NaN 8.362720e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015293N13266 NaN NaN NaN NaN NaN NaN NaN NaN 9.242430e+08 NaN NaN NaN NaN NaN
2016248N15255 NaN NaN NaN NaN NaN NaN NaN NaN 6.098482e+07 NaN NaN NaN NaN NaN
2017219N16279 NaN NaN NaN NaN NaN NaN NaN NaN 2.284435e+06 NaN NaN NaN NaN NaN
2017242N16333 2.111764e+08 1.801876e+06 3.000000e+09 NaN 1.099275e+10 NaN NaN NaN NaN NaN 18484889.46 NaN NaN 500000000.0
2017260N12310 NaN NaN NaN NaN NaN 1.534596e+09 54813712.03 NaN NaN 6.700905e+10 NaN NaN NaN NaN

This is the data against which we want to compare our model output. Let’s continue setting up the calibration!

Model Setup#

In the first step, we create the exposure layer for the model. We use the LitPop module and simply pass the names of all countries listed in our calibration data to the from_countries() classmethod. The countries are the columns in the data object.

Alternatively, we could have inserted emdat_subset["region_id"].unique().tolist().

# from climada.entity.exposures.litpop import LitPop
# from climada.util import log_level

# # Calculate the exposure
# with log_level("ERROR"):
#     exposure = LitPop.from_countries(data.columns.tolist())
from climada.util.api_client import Client
from climada.entity.exposures.litpop import LitPop
from climada.util.coordinates import country_to_iso

client = Client()
exposure = LitPop.concat(
    [
        client.get_litpop(country_to_iso(country_id, representation="alpha3"))
        for country_id in data.columns
    ]
)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
  setstate(state)
from climada.util.api_client import Client

client = Client()
tc_dataset_infos = client.list_dataset_infos(data_type="tropical_cyclone")
client.get_property_values(
    client.list_dataset_infos(data_type="tropical_cyclone"),
    known_property_values={"event_type": "observed", "spatial_coverage": "global"},
)
{'res_arcsec': ['150'],
 'event_type': ['observed'],
 'spatial_coverage': ['global'],
 'climate_scenario': ['None']}

We will use the CLIMADA Data API to download readily computed wind fields from TC tracks. The API provides a large dataset containing all historical TC tracks. We will download them and then select the subset of TCs for which we have impact data by using select().

from climada.util.api_client import Client

client = Client()
all_tcs = client.get_hazard(
    "tropical_cyclone",
    properties={"event_type": "observed", "spatial_coverage": "global"},
)
hazard = all_tcs.select(event_names=track_ids.tolist())

NOTE: Discouraged! This will usually take a longer time than using the Data API

Alternatively, CLIMADA provides the TCTracks class, which lets us download the tracks of TCs using their IBTrACS IDs. We then have to equalize the time steps of the different TC tracks.

The track and intensity of a cyclone are insufficient to compute impacts in CLIMADA. We first have to re-compute a windfield from each track at the locations of interest. For consistency, we simply choose the coordinates of the exposure.

# NOTE: Uncomment this to compute wind fields yourself

# from climada.hazard import Centroids, TCTracks, TropCyclone

# # Get the tracks for associated TCs
# tracks = TCTracks.from_ibtracs_netcdf(storm_id=track_ids.tolist())
# tracks.equal_timestep(time_step_h=1.0, land_params=False)
# tracks.plot()

# # Calculate windfield for the tracks
# centroids = Centroids.from_lat_lon(exposure.gdf.latitude, exposure.gdf.longitude)
# hazard = TropCyclone.from_tracks(tracks, centroids)

Calibration Setup#

We are now set up to define the specifics of the calibration task. First, let us define the impact function we actually want to calibrate. We select the formula by Emanuel (2011), for which a shortcut exists in the CLIMADA code base: from_emanuel_usa(). The sigmoid-like impact function takes three parameters, the wind threshold for any impact v_thresh, the wind speed where half of the impact occurs v_half, and the maximum impact factor scale. According to the model by Emanuel (2011), v_thresh is considered a constant, so we choose to only calibrate the latter two.

Any CLIMADA Optimizer will roughly perform the following algorithm:

  1. Create a set of parameters and built an impact function (set) from it.

  2. Compute an impact with that impact function set.

  3. Compare the impact and the calibration data via the cost/target function.

  4. Repeat N times or until the target function goal is reached.

The selection of parameters is based on the target function varies strongly between different optimization algorithms.

For the first step, we have to supply a function that takes the parameters we try to estimate and returns the impact function set that can later be used in an impact calculation. We only calibrate a single function for the entire basin, so this is straightforward.

To ensure the impact function is applied correctly, we also have to set the impf_ column of the exposure GeoDataFrame. Note that the default impact function ID is 1, and that the hazard type is "TC".

from climada.entity import ImpactFuncSet, ImpfTropCyclone

# Match impact function and exposure
exposure.gdf["impf_TC"] = 1


def impact_func_tc(v_half, scale):
    return ImpactFuncSet([ImpfTropCyclone.from_emanuel_usa(v_half=v_half, scale=scale)])

We will be using the BayesianOptimizer, which requires very little information on the parameter space beforehand. One crucial information are the bounds of the parameters, though. Initial values are not needed because the optimizer first samples the bound parameter space uniformly and then iteratively “narrows down” the search. We choose a v_half between v_thresh and 150, and a scale between 0.01 (it must never be zero) and 1.0. Specifying the bounds as dictionary (a must in case of BayesianOptimizer) also serves the purpose of naming the parameters we want to calibrate. Notice that these names have to match the arguments of the impact function generator.

bounds = {"v_half": (25.8, 150), "scale": (0.01, 1)}

Defining the cost function is crucial for the result of the calibration. You can choose what is best suited for your application. Often, it is not clear which function works best, and it’s a good idea to try out a few. Because the impacts of different events may vary over several orders of magnitude, we select the mean squared logartihmic error (MSLE). This one and other error measures are readily supplied by the sklearn package.

The cost function must be defined as a function that takes the impact object calculated by the optimization algorithm and the input calibration data as arguments, and that returns a single number. This number represents a “cost” of the parameter set used for calculating the impact. A higher cost therefore is worse, a lower cost is better. Any optimizer will try to minimize the cost.

Note that the impact object is an instance of Impact, whereas the input calibration data is a pd.DataFrame. To compute the MSLE, we first have to transform the impact into the same data structure, meaning that we have to aggregate the point-wise impacts by event and country. The function performing this transformation task is provided to the Input via its impact_to_dataframe attribute. Here we choose climada.engine.impact.Impact.impact_at_reg(), which aggregates over countries by default. To improve performance, we can supply this function with our known region IDs instead of re-computing them in every step.

Computations on data frames align columns and indexes. The indexes of the calibration data are the IBTrACS IDs, but the indexes of the result of impact_at_reg are the hazard event IDs, which at this point are only integer numbers. To resolve that, we adjust our calibration dataframe to carry the respective Hazard.event_id as index.

data = data.rename(
    index={
        hazard.event_name[idx]: hazard.event_id[idx]
        for idx in range(len(hazard.event_id))
    }
)
data.index.rename("event_id", inplace=True)
data
region_id 28 44 92 132 192 212 214 388 484 630 659 662 670 796
event_id
1333 NaN NaN NaN NaN NaN NaN NaN NaN 2.485465e+09 NaN NaN NaN NaN NaN
1339 1.394594e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1344 NaN NaN NaN NaN NaN NaN NaN NaN 4.846656e+09 NaN NaN NaN NaN NaN
1351 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 548667.5019 26706058.15 NaN
1361 NaN 4.352258e+07 NaN NaN NaN NaN 34283168.75 NaN NaN 5.104338e+08 NaN NaN NaN NaN
3686 NaN NaN NaN NaN NaN NaN NaN NaN 3.084603e+07 NaN NaN NaN NaN NaN
3691 NaN NaN NaN NaN NaN NaN NaN NaN 6.074341e+08 NaN NaN NaN NaN NaN
1377 NaN NaN NaN NaN NaN NaN NaN NaN 3.283428e+08 NaN NaN NaN NaN NaN
1390 NaN NaN NaN NaN NaN NaN NaN 15483975.86 NaN NaN NaN NaN NaN NaN
3743 NaN NaN NaN NaN NaN NaN NaN NaN 2.500000e+09 NaN NaN NaN NaN NaN
1421 NaN NaN NaN 1281242.483 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1426 NaN 8.362720e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3777 NaN NaN NaN NaN NaN NaN NaN NaN 9.242430e+08 NaN NaN NaN NaN NaN
3795 NaN NaN NaN NaN NaN NaN NaN NaN 6.098482e+07 NaN NaN NaN NaN NaN
1450 NaN NaN NaN NaN NaN NaN NaN NaN 2.284435e+06 NaN NaN NaN NaN NaN
1454 2.111764e+08 1.801876e+06 3.000000e+09 NaN 1.099275e+10 NaN NaN NaN NaN NaN 18484889.46 NaN NaN 500000000.0
1458 NaN NaN NaN NaN NaN 1.534596e+09 54813712.03 NaN NaN 6.700905e+10 NaN NaN NaN NaN

Execute the Calibration#

We created a class BayesianOptimizerController to control and guide the calibration process. It is intended to walk through several optimization iterations and stop the process if the best guess cannot be improved. The optimization works as follows:

  1. The optimizer randomly samples the parameter space init_points times.

  2. The optimizer uses a Gaussian regression process to “smartly” sample the parameter space at most n_iter times.

    • The process uses an “Upper Confidence Bound” sampling method whose parameter kappa indicates how close the sampled points are to the buest guess. Higher kappa means more exploration of the parameter space, lower kappa means more exploitation.

    • After each sample, the parameter kappa is reduced by the factor kappa_decay. By default, this parameter is set such that kappa equals kappa_min at the last step. This way, the sampling becomes more exploitative the more steps are taken.

  3. The controller tracks the improvements of the buest guess for parameters. If min_improvement_count consecutive improvements are lower than min_improvement, the smart sampling is stopped. In this case, the iterations count is increased and the process repeated from step 1.

  4. If an entire iteration did not show any improvement, the optimization is stopped. It is also stopped when the max_iterations count is reached.

Users can control the “density”, and thus the accuracy of the sampling by adjusting the controller parameters. Increasing init_points, n_iter, min_improvement_count, and max_iterations, and decreasing min_improvement generally increases density and accuracy, but leads to longer runtimes.

We suggest using the from_input classmethod for a convenient choice of sampling density based on the parameter space. The two parameters init_points and n_iter are set to \(b^N\), where \(N\) is the number of estimated parameters and \(b\) is the sampling_base parameter, which defaults to 4.

Now we can finally execute our calibration task! We will plug all input parameters in an instance of Input, and then create the optimizer instance with it. The Optimizer.run method returns an Output object, whose params attribute holds the optimal parameters determined by the calibration.

Notice that the BayesianOptimization maximizes a target function. Therefore, higher target values are better than lower ones in this case.

from climada.util.calibrate import Input, BayesianOptimizer, BayesianOptimizerController
from sklearn.metrics import mean_squared_log_error

from climada.util import log_level

# Define calibration input
with log_level("INFO", name_prefix="climada.util.calibrate"):
    input = Input(
        hazard=hazard,
        exposure=exposure,
        data=data,
        impact_func_creator=impact_func_tc,
        cost_func=mean_squared_log_error,
        impact_to_dataframe=lambda imp: imp.impact_at_reg(exposure.gdf["region_id"]),
        bounds=bounds,
    )

    # Create and run the optimizer
    opt = BayesianOptimizer(input)
    controller = BayesianOptimizerController.from_input(input)
    bayes_output = opt.run(controller)
    bayes_output.params  # The optimal parameters
2024-04-29 14:09:45,569 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-04-29 14:09:46,990 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-04-29 14:09:49,255 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-04-29 14:09:50,873 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-04-29 14:09:50,874 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-04-29 14:09:52,698 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-04-29 14:09:52,699 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 4
2024-04-29 14:09:55,673 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 5
2024-04-29 14:09:58,819 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 6
2024-04-29 14:10:02,072 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 7
2024-04-29 14:10:06,238 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.

Evaluate Output#

The Bayesian Optimizer returns the entire paramter space it sampled via BayesianOptimizerOutput We can find out a lot about the relation of the fitted parameters by investigating how the cost function value depends on them. We can retrieve the parameter space as pandas.DataFrame via p_space_to_dataframe(). This dataframe has MultiIndex columns. One group are the Parameters, the other holds information on the Calibration for each parameter set. Notice that the optimal parameter set is not necessarily the last entry in the parameter space!

p_space_df = bayes_output.p_space_to_dataframe()
p_space_df
Parameters Calibration
scale v_half Cost Function
Iteration
0 0.422852 115.264302 2.726950
1 0.010113 63.349706 4.133135
2 0.155288 37.268453 0.800611
3 0.194398 68.718642 1.683610
4 0.402800 92.721038 2.046407
... ... ... ...
246 0.790590 50.164555 0.765166
247 0.788425 48.043636 0.768636
248 0.826704 49.932348 0.764991
249 0.880736 49.290532 0.767437
250 0.744523 51.596642 0.770761

251 rows × 3 columns

In contrast, the controller only tracks the consecutive improvements of the best guess.

controller.improvements()
iteration random target improvement
sample
0 0 True -2.726950 inf
2 0 True -0.800611 2.406088
23 0 False -0.799484 0.001409
25 0 False -0.794119 0.006756
29 0 False -0.791723 0.003026
40 1 False -0.781115 0.013581
44 1 False -0.777910 0.004119
55 1 False -0.772557 0.006929
88 2 False -0.768626 0.005115
92 2 False -0.768494 0.000172
93 2 False -0.768144 0.000456
113 3 False -0.767889 0.000333
115 3 False -0.767105 0.001021
116 3 False -0.766561 0.000709
122 3 False -0.766400 0.000211
142 4 False -0.765210 0.001556
146 4 False -0.764947 0.000343
173 5 False -0.764000 0.001240
216 6 False -0.763936 0.000084

To get one group of the columns, simply access the item.

p_space_df["Parameters"]
scale v_half
Iteration
0 0.422852 115.264302
1 0.010113 63.349706
2 0.155288 37.268453
3 0.194398 68.718642
4 0.402800 92.721038
... ... ...
246 0.790590 50.164555
247 0.788425 48.043636
248 0.826704 49.932348
249 0.880736 49.290532
250 0.744523 51.596642

251 rows × 2 columns

Notice that the optimal parameter set is not necessarily the last entry in the parameter space! Therefore, let’s order the parameter space by the ascending cost function values.

p_space_df = p_space_df.sort_values(("Calibration", "Cost Function"), ascending=True)
p_space_df
Parameters Calibration
scale v_half Cost Function
Iteration
216 0.993744 52.255606 0.763936
173 0.985721 52.026516 0.764000
177 0.881143 51.558665 0.764718
146 0.967284 51.034702 0.764947
248 0.826704 49.932348 0.764991
... ... ... ...
35 0.028105 118.967924 6.298332
95 0.012842 102.449398 6.635728
199 0.031310 143.537900 7.262185
99 0.025663 141.236104 7.504095
232 0.010398 147.113486 9.453765

251 rows × 3 columns

The BayesianOptimizerOutput supplies the plot_p_space() method for convenience. If there were more than two parameters we calibrated, it would produce a plot for each parameter combination.

bayes_output.plot_p_space(x="v_half", y="scale")
<Axes: xlabel='(Parameters, v_half)', ylabel='(Parameters, scale)'>
../_images/3e54e32eda571fd3251e46bec71770aaa2da5917a67efccbf1b020a615fe399d.png
p_space_df["Parameters"].iloc[0, :].to_dict()
{'scale': 0.9937435057717706, 'v_half': 52.25560598796059}

Analyze the Calibration#

Now that we obtained a calibration result, we should investigate it further. The tasks of evaluating results and plotting them is simplified by the OutputEvaluator. It takes the input and output of a calibration task as parameters. Let’s start by plotting the optimized impact function:

from climada.util.calibrate import OutputEvaluator

output_eval = OutputEvaluator(input, bayes_output)
output_eval.impf_set.plot()
<Axes: title={'center': 'TC 1: Emanuel 2011'}, xlabel='Intensity (m/s)', ylabel='Impact (%)'>
../_images/76a265b4650ecb6106aaf6df44f9f7f700c42ddcbef8e4532c084fe16899ecbf.png

Here we show how the variability in parameter combinations with similar cost function values (as seen in the plot of the parameter space) translate to varying impact functions. In addition, the hazard value distribution is shown. Together this provides an intuitive overview regarding the robustness of the optimization, given the chosen cost function. It does NOT provide a view of the sampling uncertainty (as e.g. bootstrapping or cross-validation) NOR of the suitability of the cost function which is chosen by the user.

This functionality is only available from the BayesianOptimizerOutputEvaluator tailored to Bayesian optimizer outputs. It includes all function from OutputEvaluator.

from climada.util.calibrate import BayesianOptimizerOutputEvaluator, select_best

output_eval = BayesianOptimizerOutputEvaluator(input, bayes_output)

# Plot the impact function variability
output_eval.plot_impf_variability(select_best(p_space_df, 0.03), plot_haz=False)
output_eval.plot_impf_variability(select_best(p_space_df, 0.03))
<Axes: xlabel='Intensity (m/s)', ylabel='Mean Damage Ratio (MDR)'>
../_images/a824ad100c2552f507377bfc438cba15c7c556172ea8eddc2779043a2cd2b614.png ../_images/5e1e9333ce5403d32b2d0a2aac6b238f3ab6a9056dd625e7b9108eb18b022ed1.png

The target function has limited meaning outside the calibration task. To investigate the quality of the calibration, it is helpful to compute the impact with the impact function defined by the optimal parameters. The OutputEvaluator readily computed this impact when it was created. You can access the impact via the impact attribute.

import numpy as np

impact_data = output_eval.impact.impact_at_reg(exposure.gdf["region_id"])
impact_data.set_index(np.asarray(hazard.event_name), inplace=True)
impact_data
28 44 92 132 192 212 214 388 484 630 659 662 670 796
2010176N16278 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.641194e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2010236N12341 2.382896e+07 0.000000e+00 6.319901e+07 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 7.558366e+07 1.253981e+07 0.000000e+00 0.000000e+00 0.000000e+00
2010257N16282 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.292543e+07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2010302N09306 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.727897e+06 4.578844e+06 6.995127e+05
2011233N15301 0.000000e+00 1.230474e+09 2.364701e+07 0.000000 0.000000e+00 0.000000e+00 7.109156e+04 0.000000e+00 0.000000e+00 7.094676e+08 0.000000e+00 0.000000e+00 0.000000e+00 1.383507e+08
2011279N10257 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.718903e+06 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2012215N12313 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.577538e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2012166N09269 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.901566e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2012296N14283 0.000000e+00 1.323275e+08 0.000000e+00 0.000000 2.751122e+09 0.000000e+00 0.000000e+00 3.101404e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2014253N13260 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.352591e+10 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2015293N13266 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.440247e+08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2015242N12343 0.000000e+00 0.000000e+00 0.000000e+00 227605.609803 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2015270N27291 0.000000e+00 8.433712e+05 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2016248N15255 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.127676e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2017219N16279 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.101261e+07 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2017242N16333 3.750206e+08 1.354638e+06 4.684316e+08 0.000000 3.369390e+09 0.000000e+00 2.693964e+07 0.000000e+00 0.000000e+00 1.172531e+10 4.040075e+08 0.000000e+00 0.000000e+00 8.060133e+08
2017260N12310 0.000000e+00 0.000000e+00 3.413228e+07 0.000000 0.000000e+00 6.646546e+08 1.189164e+08 0.000000e+00 0.000000e+00 8.918857e+10 0.000000e+00 0.000000e+00 0.000000e+00 9.235309e+07

We can now compare the modelled and reported impact data on a country- or event-basis. The OutputEvaluator also has methods for that. In both of these, you can supply a transformation function with the data_transf argument. This transforms the data to be plotted right before plotting. Recall that we set the event IDs as index for the data frames. To better interpret the results, it is useful to transform them into event names again, which are the IBTrACS IDs. Likewise, we use the region IDs for region identification. It might be nicer to transform these into country names before plotting.

import climada.util.coordinates as u_coord


def country_code_to_name(code):
    return u_coord.country_to_iso(code, representation="name")


event_id_to_name = {
    hazard.event_id[idx]: hazard.event_name[idx] for idx in range(len(hazard.event_id))
}
output_eval.plot_at_event(
    data_transf=lambda x: x.rename(index=event_id_to_name), logy=True
)
output_eval.plot_at_region(
    data_transf=lambda x: x.rename(index=country_code_to_name), logy=True
)
<Axes: ylabel='Impact [USD]'>
../_images/d5f28c2d79e363ad56a55d0cf83512217972eb040c51e4901c40155261c44118.png ../_images/7fac96fb9cf44c233e631eed59ca10427ca9f413c762335e40a22056f2d141db.png

Finally, we can do an event- and country-based comparison using a heatmap using plot_event_region_heatmap() Since the magnitude of the impact values may differ strongly, this method compare them on a logarithmic scale. It divides each modelled impact by the observed impact and takes the the decadic logarithm. The result will tell us how many orders of magnitude our model was off. Again, the considerations for “nicer” index and columns apply.

output_eval.plot_event_region_heatmap(
    data_transf=lambda x: x.rename(index=event_id_to_name, columns=country_code_to_name)
)
<Axes: >
../_images/d7622e3a79a570d1c56a3b72ef61ed844512d41f6840c54205531ba40fa794bf.png

Handling Missing Data#

NaN-valued input data has a special meaning in calibration: It implies that the impact calculated by the model for that data point should be ignored. Opposed to that, a value of zero indicates that the model should be calibrated towards an impact of exactly zero for this data point.

There might be instances where data is provided for a certain region or event, and the model produces no impact for them. This is always treated as an impact of zero during the calibration. Likewise, there might be instances where the model computes an impact for a region for which no impact data is available. In these cases, missing_data_value of Input is used as fill value for the data. According to the data value logic mentioned above, missing_data_value=np.nan (the default setting) will cause the modeled impact to be ignored in the calibration. Setting missing_data_value=0, on the other hand, will calibrate the model towards zero impact for all regions or events where no data is supplied.

Let’s exemplify this with a subset of the data used for the last calibration. Irma is the cyclone with ID 2017242N16333, and it hit most locations we were looking at before. It has the event ID 1454 in our setup. Let’s just use this hazard and the related data for now.

NOTE: We must pass a dataframe to the input, but selecting a single row or column will return a Series. We expand it into a dataframe again and make sure the new frame is oriented the correct way.

hazard_irma = all_tcs.select(event_names=["2017242N16333"])
data_irma = data.loc[1454, :].to_frame().T
data_irma
region_id 28 44 92 132 192 212 214 388 484 630 659 662 670 796
1454 211176356.8 1801876.321 3.000000e+09 NaN 1.099275e+10 NaN NaN NaN NaN NaN 18484889.46 NaN NaN 500000000.0

Let’s first calibrate the impact function only on this event, including all data we have.

from climada.util.calibrate import (
    Input,
    BayesianOptimizer,
    OutputEvaluator,
    BayesianOptimizerOutputEvaluator,
)
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt


def calibrate(hazard, data, **input_kwargs):
    """Calibrate using custom hazard and data"""
    # Define calibration input
    input = Input(
        hazard=hazard,
        exposure=exposure,
        data=data,
        impact_func_creator=impact_func_tc,
        cost_func=mean_squared_log_error,
        impact_to_dataframe=lambda imp: imp.impact_at_reg(exposure.gdf["region_id"]),
        bounds=bounds,
        **input_kwargs,
    )

    # Create and run the optimizer
    with log_level("INFO", name_prefix="climada.util.calibrate"):
        opt = BayesianOptimizer(input)
        controller = BayesianOptimizerController.from_input(input)
        bayes_output = opt.run(controller)

    # Evaluate output
    output_eval = OutputEvaluator(input, bayes_output)
    output_eval.impf_set.plot()

    plt.figure()  # New figure because seaborn.heatmap draws into active axes
    output_eval.plot_event_region_heatmap(
        data_transf=lambda x: x.rename(
            index=event_id_to_name, columns=country_code_to_name
        )
    )

    return bayes_output.params  # The optimal parameters
param_irma = calibrate(hazard_irma, data_irma)
2024-02-20 13:27:01,235 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:02,398 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:03,922 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-02-20 13:27:05,573 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
../_images/ecd9d2cb77361ead4f88ec71bab643a0cd82f01dbb052977d036ec1e91741f5e.png ../_images/f0126fa7c05d50bd5610988fed464a3c428c057818079e0bbd0e422a101ca75d.png

If we now remove some of the damage reports and repeat the calibration, the respective impact computed by the model will be ignored. For Saint Kitts and Nevis, and for Turks and the Caicos Islands, the impact is overestimated by the model. Removing these regions from the estimation should shift the estimated parameters accordingly, because by default, impacts for missing data points are ignored with missing_data_value=np.nan.

calibrate(hazard_irma, data_irma.drop(columns=[659, 796]))
2024-02-20 13:27:12,398 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:13,605 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:14,991 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-02-20 13:27:16,771 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-02-20 13:27:18,916 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
{'scale': 0.9893201575415296, 'v_half': 46.03113797418196}
../_images/e57a1fe2d0a5043744edec49066518454dd50e883840f5e82a1b37d3af419813.png ../_images/a18f39f740323be0adb2cde36c92c42cda7553ffeac8ef0a8228e60de95228c6.png

However, the calibration should change into the other direction once we require the modeled impact at missing data points to be zero:

calibrate(hazard_irma, data_irma.drop(columns=[659, 796]), missing_data_value=0)
2024-02-20 13:27:25,785 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:26,845 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-02-20 13:27:26,845 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:28,241 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
{'scale': 0.07120177027571553, 'v_half': 134.9655530108171}
../_images/9b493227e910e1dd41baf967a90a1cd20207f7106bc64dd3242b8c03c7024adc.png ../_images/c9a84a822e462e99cdabe99a60e95fa43739402a13b1ab862fbd5a1fd92d4f82.png

Actively requiring that the model calibrates towards zero impact in the two dropped regions means that it will typically strongly overestimate the impact there (because impact actually took place). This will “flatten” the vulnerability curve, causing strong underestimation in the other regions.

How to Continue#

While the found impact function looks reasonable, we find that the model overestimates the impact severely for several events. This might be due to missing data, but it is also strongly related to the choice of impact function (shape) and the particular goal of the calibration task.

The most crucial information in the calibration task is the cost function. The RMSE measure is sensitive to the largest errors (and hence the largest impact). Therefore, using it in the calibration minimizes the overall error, but will incorrectly capture events with impact of lower orders of magnitude. Using a cost function based on the ratio between modelled and observed impact might increase the overall error but decrease the log-error for many events.

So we present some ideas on how to continue and/or improve the calibration:

  1. Run the calibration again, but change the number of initial steps and/or iteration steps.

  2. Use a different cost function, e.g., an error measure based on a ratio rather than a difference.

  3. Also calibrate the v_thresh parameter. This requires adding constraints, because v_thresh < v_half.

  4. Calibrate different impact functions for houses in Mexico and Puerto Rico within the same optimization task.

  5. Employ the ScipyMinimizeOptimizer instead of the BayesianOptimizer.