Impact Function Calibration#
CLIMADA provides the climada.util.calibrate
module for calibrating impact functions based on impact data.
This tutorial will guide through the usage of this module by calibrating an impact function for tropical cyclones (TCs).
For further information on the classes available from the module, see its documentation.
Overview#
The basic idea of the calibration is to find a set of parameters for an impact function that minimizes the deviation between the calculated impact and some impact data. For setting up a calibration task, users have to supply the following information:
Hazard and Exposure (as usual, see the tutorial)
The impact data to calibrate the model to
An impact function definition depending on the calibrated parameters
Bounds and constraints of the calibrated parameters (depending on the calibration algorithm)
A “cost function” defining the single-valued deviation between impact data and calculated impact
A function for transforming the calculated impact into the same data structure as the impact data
This information defines the calibration task and is inserted into the Input
object.
Afterwards, the user may insert this object into one of the optimizer classes.
Currently, the following classes are available:
BayesianOptimizer
: Uses Bayesian optimization to sample the parameter space.ScipyMinimizeOptimizer
: Uses thescipy.optimize.minimize
function for determining the best parameter set.
The following tutorial walks through the input data preparation and the setup of a BayesianOptimizer
instance for calibration.
For a brief example, refer to Quickstart.
If you want to go through a somewhat realistic calibration task step-by-step, continue with Calibration Data.
import logging
import climada
logging.getLogger("climada").setLevel("WARNING")
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/site-packages/dask/dataframe/_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 12.0.1. Please consider upgrading.
warnings.warn(
Quickstart#
This section gives a very quick overview of assembling a calibration task.
Here, we calibrate a single impact function for damage reports in Mexico (MEX
) from a TC with IbtracsID 2010176N16278
.
import pandas as pd
from sklearn.metrics import mean_squared_log_error
from climada.util import log_level
from climada.util.api_client import Client
from climada.entity import ImpactFuncSet, ImpfTropCyclone
from climada.util.calibrate import (
Input,
BayesianOptimizer,
BayesianOptimizerController,
OutputEvaluator,
)
# Load hazard and exposure from Data API
client = Client()
exposure = client.get_litpop("MEX")
exposure.gdf["impf_TC"] = 1
all_tcs = client.get_hazard(
"tropical_cyclone",
properties={"event_type": "observed", "spatial_coverage": "global"},
)
hazard = all_tcs.select(event_names=["2010176N16278"])
# Impact data (columns: region ID, index: hazard event ID)
data = pd.DataFrame(data=[[2.485465e09]], columns=[484], index=list(hazard.event_id))
# Create input
inp = Input(
hazard=hazard,
exposure=exposure,
data=data,
# Generate impact function from estimated parameters
impact_func_creator=lambda v_half: ImpactFuncSet(
[ImpfTropCyclone.from_emanuel_usa(v_half=v_half, impf_id=1)]
),
# Estimated parameter bounds
bounds={"v_half": (26, 100)},
# Cost function
cost_func=mean_squared_log_error,
# Transform impact to pandas Dataframe with same structure as data
impact_to_dataframe=lambda impact: impact.impact_at_reg(exposure.gdf["region_id"]),
)
# Set up optimizer (with controller)
controller = BayesianOptimizerController.from_input(inp)
opt = BayesianOptimizer(inp)
# Run optimization
with log_level("WARNING", "climada.engine.impact_calc"):
output = opt.run(controller)
# Analyse results
output.plot_p_space()
out_eval = OutputEvaluator(inp, output)
out_eval.impf_set.plot()
# Optimal value
output.params
2024-07-08 16:36:14,126 - climada.entity.exposures.base - INFO - Reading /Users/ldr.riedel/climada/data/exposures/litpop/LitPop_150arcsec_MEX/v3/LitPop_150arcsec_MEX.hdf5
2024-07-08 16:36:20,330 - climada.hazard.base - INFO - Reading /Users/ldr.riedel/climada/data/hazard/tropical_cyclone/tropical_cyclone_0synth_tracks_150arcsec_global_1980_2020/v2/tropical_cyclone_0synth_tracks_150arcsec_global_1980_2020.hdf5
2024-07-08 16:36:28,741 - climada.entity.exposures.base - INFO - Matching 100369 exposures with 6125253 centroids.
2024-07-08 16:36:30,939 - climada.util.coordinates - INFO - No exact centroid match found. Reprojecting coordinates to nearest neighbor closer than the threshold = 100
2024-07-08 16:36:34,188 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-07-08 16:36:34,646 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-07-08 16:36:34,996 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-07-08 16:36:35,313 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-07-08 16:36:35,690 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
2024-07-08 16:36:35,733 - climada.entity.exposures.base - INFO - Exposures matching centroids already found for TC
2024-07-08 16:36:35,736 - climada.entity.exposures.base - INFO - Existing centroids will be overwritten for TC
2024-07-08 16:36:35,736 - climada.entity.exposures.base - INFO - Matching 100369 exposures with 6125253 centroids.
2024-07-08 16:36:37,907 - climada.util.coordinates - INFO - No exact centroid match found. Reprojecting coordinates to nearest neighbor closer than the threshold = 100
2024-07-08 16:36:41,611 - climada.engine.impact_calc - INFO - Calculating impact for 292230 assets (>0) and 1 events.
{'v_half': 48.30330549244917}
Follow the next sections of the tutorial for a more in-depth explanation.
Calibration Data#
CLIMADA ships data from the International Disaster Database EM-DAT, which we will use to calibrate impact functions on.
In the first step, we will select TC events that caused damages in the NA1
basin since 2010.
We use EMDAT data from TCs occurring in the NA1
basin since 2010.
We calculate the centroids for which we want to compute the windfields by extracting the countries hit by the cyclones and retrieving a LitPop
exposure instance from them.
We then use the exposure coordinates as centroids coordinates.
import pandas as pd
from climada.util.constants import SYSTEM_DIR
emdat = pd.read_csv(SYSTEM_DIR / "tc_impf_cal_v01_EDR.csv")
emdat_subset = emdat[(emdat.cal_region2 == "NA1") & (emdat.year >= 2010)]
emdat_subset
country | region_id | cal_region2 | year | EM_ID | ibtracsID | emdat_impact | reference_year | emdat_impact_scaled | climada_impact | ... | scale | log_ratio | unique_ID | Associated_disaster | Surge | Rain | Flood | Slide | Other | OtherThanSurge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
326 | MEX | 484 | NA1 | 2010 | 2010-0260 | 2010176N16278 | 2.000000e+09 | 2014 | 2.485465e+09 | 2.478270e+09 | ... | 1.0 | -0.002899 | 2010-0260MEX | True | False | False | True | False | False | True |
331 | ATG | 28 | NA1 | 2010 | 2010-0468 | 2010236N12341 | 1.260000e+07 | 2014 | 1.394594e+07 | 1.402875e+07 | ... | 1.0 | 0.005920 | 2010-0468ATG | True | False | False | True | False | False | True |
334 | MEX | 484 | NA1 | 2010 | 2010-0494 | 2010257N16282 | 3.900000e+09 | 2014 | 4.846656e+09 | 4.857140e+09 | ... | 1.0 | 0.002161 | 2010-0494MEX | True | False | False | True | False | False | True |
339 | LCA | 662 | NA1 | 2010 | 2010-0571 | 2010302N09306 | 5.000000e+05 | 2014 | 5.486675e+05 | 5.492871e+05 | ... | 1.0 | 0.001129 | 2010-0571LCA | True | False | False | True | True | False | True |
340 | VCT | 670 | NA1 | 2010 | 2010-0571 | 2010302N09306 | 2.500000e+07 | 2014 | 2.670606e+07 | 2.676927e+07 | ... | 1.0 | 0.002364 | 2010-0571VCT | False | False | False | False | False | False | False |
344 | BHS | 44 | NA1 | 2011 | 2011-0328 | 2011233N15301 | 4.000000e+07 | 2014 | 4.352258e+07 | 4.339898e+07 | ... | 1.0 | -0.002844 | 2011-0328BHS | False | False | False | False | False | False | False |
345 | DOM | 214 | NA1 | 2011 | 2011-0328 | 2011233N15301 | 3.000000e+07 | 2014 | 3.428317e+07 | 3.404744e+07 | ... | 1.0 | -0.006900 | 2011-0328DOM | True | False | False | True | False | False | True |
346 | PRI | 630 | NA1 | 2011 | 2011-0328 | 2011233N15301 | 5.000000e+08 | 2014 | 5.104338e+08 | 5.139659e+08 | ... | 1.0 | 0.006896 | 2011-0328PRI | True | False | False | True | False | False | True |
352 | MEX | 484 | NA1 | 2011 | 2011-0385 | 2011279N10257 | 2.770000e+07 | 2014 | 3.084603e+07 | 3.077374e+07 | ... | 1.0 | -0.002346 | 2011-0385MEX | True | False | False | True | True | False | True |
359 | MEX | 484 | NA1 | 2012 | 2012-0276 | 2012215N12313 | 3.000000e+08 | 2014 | 3.283428e+08 | 3.284805e+08 | ... | 1.0 | 0.000419 | 2012-0276MEX | False | False | False | False | False | False | False |
365 | MEX | 484 | NA1 | 2012 | 2012-0401 | 2012166N09269 | 5.550000e+08 | 2014 | 6.074341e+08 | 6.103254e+08 | ... | 1.0 | 0.004749 | 2012-0401MEX | False | False | False | False | False | False | False |
369 | JAM | 388 | NA1 | 2012 | 2012-0410 | 2012296N14283 | 1.654200e+07 | 2014 | 1.548398e+07 | 1.548091e+07 | ... | 1.0 | -0.000198 | 2012-0410JAM | False | False | False | False | False | False | False |
406 | MEX | 484 | NA1 | 2014 | 2014-0333 | 2014253N13260 | 2.500000e+09 | 2014 | 2.500000e+09 | 2.501602e+09 | ... | 1.0 | 0.000640 | 2014-0333MEX | True | False | False | True | False | False | True |
427 | MEX | 484 | NA1 | 2015 | 2015-0470 | 2015293N13266 | 8.230000e+08 | 2014 | 9.242430e+08 | 9.239007e+08 | ... | 1.0 | -0.000370 | 2015-0470MEX | True | False | False | True | False | False | True |
428 | CPV | 132 | NA1 | 2015 | 2015-0473 | 2015242N12343 | 1.100000e+06 | 2014 | 1.281242e+06 | 1.282842e+06 | ... | 1.0 | 0.001248 | 2015-0473CPV | False | False | False | False | False | False | False |
429 | BHS | 44 | NA1 | 2015 | 2015-0479 | 2015270N27291 | 9.000000e+07 | 2014 | 8.362720e+07 | 1.129343e+06 | ... | 1.0 | -4.304733 | 2015-0479BHS | True | True | False | True | False | False | True |
437 | MEX | 484 | NA1 | 2016 | 2016-0319 | 2016248N15255 | 5.000000e+07 | 2014 | 6.098482e+07 | 6.088808e+07 | ... | 1.0 | -0.001588 | 2016-0319MEX | True | False | False | True | False | False | True |
451 | MEX | 484 | NA1 | 2017 | 2017-0334 | 2017219N16279 | 2.000000e+06 | 2014 | 2.284435e+06 | 2.285643e+06 | ... | 1.0 | 0.000529 | 2017-0334MEX | True | False | False | True | True | False | True |
456 | ATG | 28 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 2.500000e+08 | 2014 | 2.111764e+08 | 2.110753e+08 | ... | 1.0 | -0.000479 | 2017-0381ATG | False | False | False | False | False | False | False |
457 | BHS | 44 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 2.000000e+06 | 2014 | 1.801876e+06 | 6.118379e+05 | ... | 1.0 | -1.080116 | 2017-0381BHS | False | False | False | False | False | False | False |
458 | CUB | 192 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 1.320000e+10 | 2014 | 1.099275e+10 | 1.090424e+10 | ... | 1.0 | -0.008085 | 2017-0381CUB | True | False | False | True | False | False | True |
459 | KNA | 659 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 2.000000e+07 | 2014 | 1.848489e+07 | 1.848665e+07 | ... | 1.0 | 0.000095 | 2017-0381KNA | False | False | False | False | False | False | False |
460 | TCA | 796 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 5.000000e+08 | 2014 | 5.000000e+08 | 5.003790e+08 | ... | 1.0 | 0.000758 | 2017-0381TCA | True | False | False | True | False | False | True |
462 | VGB | 92 | NA1 | 2017 | 2017-0381 | 2017242N16333 | 3.000000e+09 | 2014 | 3.000000e+09 | 6.236200e+08 | ... | 1.0 | -1.570826 | 2017-0381VGB | False | False | False | False | False | False | False |
463 | DMA | 212 | NA1 | 2017 | 2017-0383 | 2017260N12310 | 1.456000e+09 | 2014 | 1.534596e+09 | 8.951186e+08 | ... | 1.0 | -0.539066 | 2017-0383DMA | True | False | False | True | True | False | True |
464 | DOM | 214 | NA1 | 2017 | 2017-0383 | 2017260N12310 | 6.300000e+07 | 2014 | 5.481371e+07 | 5.493466e+07 | ... | 1.0 | 0.002204 | 2017-0383DOM | True | False | False | True | True | False | True |
465 | PRI | 630 | NA1 | 2017 | 2017-0383 | 2017260N12310 | 6.800000e+10 | 2014 | 6.700905e+10 | 6.702718e+10 | ... | 1.0 | 0.000271 | 2017-0383PRI | True | False | False | True | False | True | True |
27 rows × 22 columns
Each entry in the database refers to an economic impact for a specific country and TC event. The TC events are identified by the ID assigned from the International Best Track Archive for Climate Stewardship (IBTrACS). We now want to reshape this data so that impacts are grouped by event and country.
To achieve this, we iterate over the unique track IDs, select all reported damages associated with this ID, and concatenate the results.
For missing entries, pandas
will set the value to NaN.
We assume that missing entries means that no damages are reported (this is a strong assumption), and set all NaN values to zero.
Then, we transpose the dataframe so that each row represents an event and each column states the damage for a specific country.
Finally, we set the track ID to be the index of the data frame.
track_ids = emdat_subset["ibtracsID"].unique()
data = pd.pivot_table(
emdat_subset,
values="emdat_impact_scaled",
index="ibtracsID",
columns="region_id",
# fill_value=0,
)
data
region_id | 28 | 44 | 92 | 132 | 192 | 212 | 214 | 388 | 484 | 630 | 659 | 662 | 670 | 796 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ibtracsID | ||||||||||||||
2010176N16278 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.485465e+09 | NaN | NaN | NaN | NaN | NaN |
2010236N12341 | 1.394594e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2010257N16282 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.846656e+09 | NaN | NaN | NaN | NaN | NaN |
2010302N09306 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 548667.5019 | 26706058.15 | NaN |
2011233N15301 | NaN | 4.352258e+07 | NaN | NaN | NaN | NaN | 34283168.75 | NaN | NaN | 5.104338e+08 | NaN | NaN | NaN | NaN |
2011279N10257 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.084603e+07 | NaN | NaN | NaN | NaN | NaN |
2012166N09269 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.074341e+08 | NaN | NaN | NaN | NaN | NaN |
2012215N12313 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.283428e+08 | NaN | NaN | NaN | NaN | NaN |
2012296N14283 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15483975.86 | NaN | NaN | NaN | NaN | NaN | NaN |
2014253N13260 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.500000e+09 | NaN | NaN | NaN | NaN | NaN |
2015242N12343 | NaN | NaN | NaN | 1281242.483 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2015270N27291 | NaN | 8.362720e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2015293N13266 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9.242430e+08 | NaN | NaN | NaN | NaN | NaN |
2016248N15255 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.098482e+07 | NaN | NaN | NaN | NaN | NaN |
2017219N16279 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.284435e+06 | NaN | NaN | NaN | NaN | NaN |
2017242N16333 | 2.111764e+08 | 1.801876e+06 | 3.000000e+09 | NaN | 1.099275e+10 | NaN | NaN | NaN | NaN | NaN | 18484889.46 | NaN | NaN | 500000000.0 |
2017260N12310 | NaN | NaN | NaN | NaN | NaN | 1.534596e+09 | 54813712.03 | NaN | NaN | 6.700905e+10 | NaN | NaN | NaN | NaN |
This is the data against which we want to compare our model output. Let’s continue setting up the calibration!
Model Setup#
In the first step, we create the exposure layer for the model.
We use the LitPop
module and simply pass the names of all countries listed in our calibration data to the from_countries()
classmethod.
The countries are the columns in the data
object.
Alternatively, we could have inserted emdat_subset["region_id"].unique().tolist()
.
# from climada.entity.exposures.litpop import LitPop
# from climada.util import log_level
# # Calculate the exposure
# with log_level("ERROR"):
# exposure = LitPop.from_countries(data.columns.tolist())
from climada.util.api_client import Client
from climada.entity.exposures.litpop import LitPop
from climada.util.coordinates import country_to_iso
client = Client()
exposure = LitPop.concat(
[
client.get_litpop(country_to_iso(country_id, representation="alpha3"))
for country_id in data.columns
]
)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
/Users/ldr.riedel/miniforge3/envs/climada_env_3.9/lib/python3.9/pickle.py:1717: UserWarning: Unpickling a shapely <2.0 geometry object. Please save the pickle again; shapely 2.1 will not have this compatibility.
setstate(state)
from climada.util.api_client import Client
client = Client()
tc_dataset_infos = client.list_dataset_infos(data_type="tropical_cyclone")
client.get_property_values(
client.list_dataset_infos(data_type="tropical_cyclone"),
known_property_values={"event_type": "observed", "spatial_coverage": "global"},
)
{'res_arcsec': ['150'],
'event_type': ['observed'],
'spatial_coverage': ['global'],
'climate_scenario': ['None']}
We will use the CLIMADA Data API to download readily computed wind fields from TC tracks.
The API provides a large dataset containing all historical TC tracks.
We will download them and then select the subset of TCs for which we have impact data by using select()
.
from climada.util.api_client import Client
client = Client()
all_tcs = client.get_hazard(
"tropical_cyclone",
properties={"event_type": "observed", "spatial_coverage": "global"},
)
hazard = all_tcs.select(event_names=track_ids.tolist())
NOTE: Discouraged! This will usually take a longer time than using the Data API
Alternatively, CLIMADA provides the TCTracks
class, which lets us download the tracks of TCs using their IBTrACS IDs.
We then have to equalize the time steps of the different TC tracks.
The track and intensity of a cyclone are insufficient to compute impacts in CLIMADA. We first have to re-compute a windfield from each track at the locations of interest. For consistency, we simply choose the coordinates of the exposure.
# NOTE: Uncomment this to compute wind fields yourself
# from climada.hazard import Centroids, TCTracks, TropCyclone
# # Get the tracks for associated TCs
# tracks = TCTracks.from_ibtracs_netcdf(storm_id=track_ids.tolist())
# tracks.equal_timestep(time_step_h=1.0, land_params=False)
# tracks.plot()
# # Calculate windfield for the tracks
# centroids = Centroids.from_lat_lon(exposure.gdf.latitude, exposure.gdf.longitude)
# hazard = TropCyclone.from_tracks(tracks, centroids)
Calibration Setup#
We are now set up to define the specifics of the calibration task.
First, let us define the impact function we actually want to calibrate.
We select the formula by Emanuel (2011), for which a shortcut exists in the CLIMADA code base: from_emanuel_usa()
.
The sigmoid-like impact function takes three parameters, the wind threshold for any impact v_thresh
, the wind speed where half of the impact occurs v_half
, and the maximum impact factor scale
.
According to the model by Emanuel (2011), v_thresh
is considered a constant, so we choose to only calibrate the latter two.
Any CLIMADA Optimizer
will roughly perform the following algorithm:
Create a set of parameters and built an impact function (set) from it.
Compute an impact with that impact function set.
Compare the impact and the calibration data via the cost/target function.
Repeat N times or until the target function goal is reached.
The selection of parameters is based on the target function varies strongly between different optimization algorithms.
For the first step, we have to supply a function that takes the parameters we try to estimate and returns the impact function set that can later be used in an impact calculation. We only calibrate a single function for the entire basin, so this is straightforward.
To ensure the impact function is applied correctly, we also have to set the impf_
column of the exposure GeoDataFrame
.
Note that the default impact function ID is 1, and that the hazard type is "TC"
.
from climada.entity import ImpactFuncSet, ImpfTropCyclone
# Match impact function and exposure
exposure.gdf["impf_TC"] = 1
def impact_func_tc(v_half, scale):
return ImpactFuncSet([ImpfTropCyclone.from_emanuel_usa(v_half=v_half, scale=scale)])
We will be using the BayesianOptimizer
, which requires very little information on the parameter space beforehand.
One crucial information are the bounds of the parameters, though.
Initial values are not needed because the optimizer first samples the bound parameter space uniformly and then iteratively “narrows down” the search.
We choose a v_half
between v_thresh
and 150, and a scale between 0.01 (it must never be zero) and 1.0.
Specifying the bounds as dictionary (a must in case of BayesianOptimizer
) also serves the purpose of naming the parameters we want to calibrate.
Notice that these names have to match the arguments of the impact function generator.
bounds = {"v_half": (25.8, 150), "scale": (0.01, 1)}
Defining the cost function is crucial for the result of the calibration.
You can choose what is best suited for your application.
Often, it is not clear which function works best, and it’s a good idea to try out a few.
Because the impacts of different events may vary over several orders of magnitude, we select the mean squared logartihmic error (MSLE).
This one and other error measures are readily supplied by the sklearn
package.
The cost function must be defined as a function that takes the impact object calculated by the optimization algorithm and the input calibration data as arguments, and that returns a single number. This number represents a “cost” of the parameter set used for calculating the impact. A higher cost therefore is worse, a lower cost is better. Any optimizer will try to minimize the cost.
Note that the impact object is an instance of Impact
, whereas the input calibration data is a pd.DataFrame
.
To compute the MSLE, we first have to transform the impact into the same data structure, meaning that we have to aggregate the point-wise impacts by event and country.
The function performing this transformation task is provided to the Input
via its impact_to_dataframe
attribute.
Here we choose climada.engine.impact.Impact.impact_at_reg()
, which aggregates over countries by default.
To improve performance, we can supply this function with our known region IDs instead of re-computing them in every step.
Computations on data frames align columns and indexes.
The indexes of the calibration data are the IBTrACS IDs, but the indexes of the result of impact_at_reg
are the hazard event IDs, which at this point are only integer numbers.
To resolve that, we adjust our calibration dataframe to carry the respective Hazard.event_id
as index.
data = data.rename(
index={
hazard.event_name[idx]: hazard.event_id[idx]
for idx in range(len(hazard.event_id))
}
)
data.index.rename("event_id", inplace=True)
data
region_id | 28 | 44 | 92 | 132 | 192 | 212 | 214 | 388 | 484 | 630 | 659 | 662 | 670 | 796 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
event_id | ||||||||||||||
1333 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.485465e+09 | NaN | NaN | NaN | NaN | NaN |
1339 | 1.394594e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1344 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.846656e+09 | NaN | NaN | NaN | NaN | NaN |
1351 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 548667.5019 | 26706058.15 | NaN |
1361 | NaN | 4.352258e+07 | NaN | NaN | NaN | NaN | 34283168.75 | NaN | NaN | 5.104338e+08 | NaN | NaN | NaN | NaN |
3686 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.084603e+07 | NaN | NaN | NaN | NaN | NaN |
3691 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.074341e+08 | NaN | NaN | NaN | NaN | NaN |
1377 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.283428e+08 | NaN | NaN | NaN | NaN | NaN |
1390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15483975.86 | NaN | NaN | NaN | NaN | NaN | NaN |
3743 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.500000e+09 | NaN | NaN | NaN | NaN | NaN |
1421 | NaN | NaN | NaN | 1281242.483 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1426 | NaN | 8.362720e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3777 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9.242430e+08 | NaN | NaN | NaN | NaN | NaN |
3795 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.098482e+07 | NaN | NaN | NaN | NaN | NaN |
1450 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.284435e+06 | NaN | NaN | NaN | NaN | NaN |
1454 | 2.111764e+08 | 1.801876e+06 | 3.000000e+09 | NaN | 1.099275e+10 | NaN | NaN | NaN | NaN | NaN | 18484889.46 | NaN | NaN | 500000000.0 |
1458 | NaN | NaN | NaN | NaN | NaN | 1.534596e+09 | 54813712.03 | NaN | NaN | 6.700905e+10 | NaN | NaN | NaN | NaN |
Execute the Calibration#
We created a class BayesianOptimizerController
to control and guide the calibration process.
It is intended to walk through several optimization iterations and stop the process if the best guess cannot be improved.
The optimization works as follows:
The optimizer randomly samples the parameter space
init_points
times.The optimizer uses a Gaussian regression process to “smartly” sample the parameter space at most
n_iter
times.The process uses an “Upper Confidence Bound” sampling method whose parameter
kappa
indicates how close the sampled points are to the buest guess. Higherkappa
means more exploration of the parameter space, lowerkappa
means more exploitation.After each sample, the parameter
kappa
is reduced by the factorkappa_decay
. By default, this parameter is set such thatkappa
equalskappa_min
at the last step. This way, the sampling becomes more exploitative the more steps are taken.
The controller tracks the improvements of the buest guess for parameters. If
min_improvement_count
consecutive improvements are lower thanmin_improvement
, the smart sampling is stopped. In this case, theiterations
count is increased and the process repeated from step 1.If an entire iteration did not show any improvement, the optimization is stopped. It is also stopped when the
max_iterations
count is reached.
Users can control the “density”, and thus the accuracy of the sampling by adjusting the controller parameters.
Increasing init_points
, n_iter
, min_improvement_count
, and max_iterations
, and decreasing min_improvement
generally increases density and accuracy, but leads to longer runtimes.
We suggest using the from_input
classmethod for a convenient choice of sampling density based on the parameter space.
The two parameters init_points
and n_iter
are set to \(b^N\), where \(N\) is the number of estimated parameters and \(b\) is the sampling_base
parameter, which defaults to 4.
Now we can finally execute our calibration task!
We will plug all input parameters in an instance of Input
, and then create the optimizer instance with it.
The Optimizer.run
method returns an Output
object, whose params
attribute holds the optimal parameters determined by the calibration.
Notice that the BayesianOptimization
maximizes a target function.
Therefore, higher target values are better than lower ones in this case.
from climada.util.calibrate import Input, BayesianOptimizer, BayesianOptimizerController
from sklearn.metrics import mean_squared_log_error
from climada.util import log_level
# Define calibration input
with log_level("INFO", name_prefix="climada.util.calibrate"):
input = Input(
hazard=hazard,
exposure=exposure,
data=data,
impact_func_creator=impact_func_tc,
cost_func=mean_squared_log_error,
impact_to_dataframe=lambda imp: imp.impact_at_reg(exposure.gdf["region_id"]),
bounds=bounds,
)
# Create and run the optimizer
opt = BayesianOptimizer(input)
controller = BayesianOptimizerController.from_input(input)
bayes_output = opt.run(controller)
bayes_output.params # The optimal parameters
2024-04-29 14:09:45,569 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-04-29 14:09:46,990 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-04-29 14:09:49,255 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-04-29 14:09:50,873 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-04-29 14:09:50,874 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-04-29 14:09:52,698 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-04-29 14:09:52,699 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 4
2024-04-29 14:09:55,673 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 5
2024-04-29 14:09:58,819 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 6
2024-04-29 14:10:02,072 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 7
2024-04-29 14:10:06,238 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
Evaluate Output#
The Bayesian Optimizer returns the entire paramter space it sampled via BayesianOptimizerOutput
We can find out a lot about the relation of the fitted parameters by investigating how the cost function value depends on them.
We can retrieve the parameter space as pandas.DataFrame
via p_space_to_dataframe()
.
This dataframe has MultiIndex columns.
One group are the Parameters
, the other holds information on the Calibration
for each parameter set.
Notice that the optimal parameter set is not necessarily the last entry in the parameter space!
p_space_df = bayes_output.p_space_to_dataframe()
p_space_df
Parameters | Calibration | ||
---|---|---|---|
scale | v_half | Cost Function | |
Iteration | |||
0 | 0.422852 | 115.264302 | 2.726950 |
1 | 0.010113 | 63.349706 | 4.133135 |
2 | 0.155288 | 37.268453 | 0.800611 |
3 | 0.194398 | 68.718642 | 1.683610 |
4 | 0.402800 | 92.721038 | 2.046407 |
... | ... | ... | ... |
246 | 0.790590 | 50.164555 | 0.765166 |
247 | 0.788425 | 48.043636 | 0.768636 |
248 | 0.826704 | 49.932348 | 0.764991 |
249 | 0.880736 | 49.290532 | 0.767437 |
250 | 0.744523 | 51.596642 | 0.770761 |
251 rows × 3 columns
In contrast, the controller only tracks the consecutive improvements of the best guess.
controller.improvements()
iteration | random | target | improvement | |
---|---|---|---|---|
sample | ||||
0 | 0 | True | -2.726950 | inf |
2 | 0 | True | -0.800611 | 2.406088 |
23 | 0 | False | -0.799484 | 0.001409 |
25 | 0 | False | -0.794119 | 0.006756 |
29 | 0 | False | -0.791723 | 0.003026 |
40 | 1 | False | -0.781115 | 0.013581 |
44 | 1 | False | -0.777910 | 0.004119 |
55 | 1 | False | -0.772557 | 0.006929 |
88 | 2 | False | -0.768626 | 0.005115 |
92 | 2 | False | -0.768494 | 0.000172 |
93 | 2 | False | -0.768144 | 0.000456 |
113 | 3 | False | -0.767889 | 0.000333 |
115 | 3 | False | -0.767105 | 0.001021 |
116 | 3 | False | -0.766561 | 0.000709 |
122 | 3 | False | -0.766400 | 0.000211 |
142 | 4 | False | -0.765210 | 0.001556 |
146 | 4 | False | -0.764947 | 0.000343 |
173 | 5 | False | -0.764000 | 0.001240 |
216 | 6 | False | -0.763936 | 0.000084 |
To get one group of the columns, simply access the item.
p_space_df["Parameters"]
scale | v_half | |
---|---|---|
Iteration | ||
0 | 0.422852 | 115.264302 |
1 | 0.010113 | 63.349706 |
2 | 0.155288 | 37.268453 |
3 | 0.194398 | 68.718642 |
4 | 0.402800 | 92.721038 |
... | ... | ... |
246 | 0.790590 | 50.164555 |
247 | 0.788425 | 48.043636 |
248 | 0.826704 | 49.932348 |
249 | 0.880736 | 49.290532 |
250 | 0.744523 | 51.596642 |
251 rows × 2 columns
Notice that the optimal parameter set is not necessarily the last entry in the parameter space! Therefore, let’s order the parameter space by the ascending cost function values.
p_space_df = p_space_df.sort_values(("Calibration", "Cost Function"), ascending=True)
p_space_df
Parameters | Calibration | ||
---|---|---|---|
scale | v_half | Cost Function | |
Iteration | |||
216 | 0.993744 | 52.255606 | 0.763936 |
173 | 0.985721 | 52.026516 | 0.764000 |
177 | 0.881143 | 51.558665 | 0.764718 |
146 | 0.967284 | 51.034702 | 0.764947 |
248 | 0.826704 | 49.932348 | 0.764991 |
... | ... | ... | ... |
35 | 0.028105 | 118.967924 | 6.298332 |
95 | 0.012842 | 102.449398 | 6.635728 |
199 | 0.031310 | 143.537900 | 7.262185 |
99 | 0.025663 | 141.236104 | 7.504095 |
232 | 0.010398 | 147.113486 | 9.453765 |
251 rows × 3 columns
The BayesianOptimizerOutput
supplies the plot_p_space()
method for convenience.
If there were more than two parameters we calibrated, it would produce a plot for each parameter combination.
bayes_output.plot_p_space(x="v_half", y="scale")
<Axes: xlabel='(Parameters, v_half)', ylabel='(Parameters, scale)'>
p_space_df["Parameters"].iloc[0, :].to_dict()
{'scale': 0.9937435057717706, 'v_half': 52.25560598796059}
Analyze the Calibration#
Now that we obtained a calibration result, we should investigate it further.
The tasks of evaluating results and plotting them is simplified by the OutputEvaluator
.
It takes the input and output of a calibration task as parameters.
Let’s start by plotting the optimized impact function:
from climada.util.calibrate import OutputEvaluator
output_eval = OutputEvaluator(input, bayes_output)
output_eval.impf_set.plot()
<Axes: title={'center': 'TC 1: Emanuel 2011'}, xlabel='Intensity (m/s)', ylabel='Impact (%)'>
Here we show how the variability in parameter combinations with similar cost function values (as seen in the plot of the parameter space) translate to varying impact functions. In addition, the hazard value distribution is shown. Together this provides an intuitive overview regarding the robustness of the optimization, given the chosen cost function. It does NOT provide a view of the sampling uncertainty (as e.g. bootstrapping or cross-validation) NOR of the suitability of the cost function which is chosen by the user.
This functionality is only available from the BayesianOptimizerOutputEvaluator
tailored to Bayesian optimizer outputs.
It includes all function from OutputEvaluator
.
from climada.util.calibrate import BayesianOptimizerOutputEvaluator, select_best
output_eval = BayesianOptimizerOutputEvaluator(input, bayes_output)
# Plot the impact function variability
output_eval.plot_impf_variability(select_best(p_space_df, 0.03), plot_haz=False)
output_eval.plot_impf_variability(select_best(p_space_df, 0.03))
<Axes: xlabel='Intensity (m/s)', ylabel='Mean Damage Ratio (MDR)'>
The target function has limited meaning outside the calibration task.
To investigate the quality of the calibration, it is helpful to compute the impact with the impact function defined by the optimal parameters.
The OutputEvaluator
readily computed this impact when it was created.
You can access the impact via the impact
attribute.
import numpy as np
impact_data = output_eval.impact.impact_at_reg(exposure.gdf["region_id"])
impact_data.set_index(np.asarray(hazard.event_name), inplace=True)
impact_data
28 | 44 | 92 | 132 | 192 | 212 | 214 | 388 | 484 | 630 | 659 | 662 | 670 | 796 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010176N16278 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.641194e+09 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2010236N12341 | 2.382896e+07 | 0.000000e+00 | 6.319901e+07 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 7.558366e+07 | 1.253981e+07 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2010257N16282 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.292543e+07 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2010302N09306 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.727897e+06 | 4.578844e+06 | 6.995127e+05 |
2011233N15301 | 0.000000e+00 | 1.230474e+09 | 2.364701e+07 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 7.109156e+04 | 0.000000e+00 | 0.000000e+00 | 7.094676e+08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.383507e+08 |
2011279N10257 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.718903e+06 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2012215N12313 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.577538e+08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2012166N09269 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.901566e+08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2012296N14283 | 0.000000e+00 | 1.323275e+08 | 0.000000e+00 | 0.000000 | 2.751122e+09 | 0.000000e+00 | 0.000000e+00 | 3.101404e+09 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2014253N13260 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.352591e+10 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2015293N13266 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.440247e+08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2015242N12343 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 227605.609803 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2015270N27291 | 0.000000e+00 | 8.433712e+05 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2016248N15255 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.127676e+09 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2017219N16279 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.101261e+07 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
2017242N16333 | 3.750206e+08 | 1.354638e+06 | 4.684316e+08 | 0.000000 | 3.369390e+09 | 0.000000e+00 | 2.693964e+07 | 0.000000e+00 | 0.000000e+00 | 1.172531e+10 | 4.040075e+08 | 0.000000e+00 | 0.000000e+00 | 8.060133e+08 |
2017260N12310 | 0.000000e+00 | 0.000000e+00 | 3.413228e+07 | 0.000000 | 0.000000e+00 | 6.646546e+08 | 1.189164e+08 | 0.000000e+00 | 0.000000e+00 | 8.918857e+10 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.235309e+07 |
We can now compare the modelled and reported impact data on a country- or event-basis.
The OutputEvaluator
also has methods for that.
In both of these, you can supply a transformation function with the data_transf
argument.
This transforms the data to be plotted right before plotting.
Recall that we set the event IDs as index for the data frames.
To better interpret the results, it is useful to transform them into event names again, which are the IBTrACS IDs.
Likewise, we use the region IDs for region identification.
It might be nicer to transform these into country names before plotting.
import climada.util.coordinates as u_coord
def country_code_to_name(code):
return u_coord.country_to_iso(code, representation="name")
event_id_to_name = {
hazard.event_id[idx]: hazard.event_name[idx] for idx in range(len(hazard.event_id))
}
output_eval.plot_at_event(
data_transf=lambda x: x.rename(index=event_id_to_name), logy=True
)
output_eval.plot_at_region(
data_transf=lambda x: x.rename(index=country_code_to_name), logy=True
)
<Axes: ylabel='Impact [USD]'>
Finally, we can do an event- and country-based comparison using a heatmap using plot_event_region_heatmap()
Since the magnitude of the impact values may differ strongly, this method compare them on a logarithmic scale.
It divides each modelled impact by the observed impact and takes the the decadic logarithm.
The result will tell us how many orders of magnitude our model was off.
Again, the considerations for “nicer” index and columns apply.
output_eval.plot_event_region_heatmap(
data_transf=lambda x: x.rename(index=event_id_to_name, columns=country_code_to_name)
)
<Axes: >
Handling Missing Data#
NaN-valued input data has a special meaning in calibration: It implies that the impact calculated by the model for that data point should be ignored. Opposed to that, a value of zero indicates that the model should be calibrated towards an impact of exactly zero for this data point.
There might be instances where data is provided for a certain region or event, and the model produces no impact for them.
This is always treated as an impact of zero during the calibration.
Likewise, there might be instances where the model computes an impact for a region for which no impact data is available.
In these cases, missing_data_value
of Input
is used as fill value for the data.
According to the data value logic mentioned above, missing_data_value=np.nan
(the default setting) will cause the modeled impact to be ignored in the calibration.
Setting missing_data_value=0
, on the other hand, will calibrate the model towards zero impact for all regions or events where no data is supplied.
Let’s exemplify this with a subset of the data used for the last calibration.
Irma is the cyclone with ID 2017242N16333
, and it hit most locations we were looking at before.
It has the event ID 1454
in our setup.
Let’s just use this hazard and the related data for now.
NOTE: We must pass a dataframe to the input, but selecting a single row or column will return a Series. We expand it into a dataframe again and make sure the new frame is oriented the correct way.
hazard_irma = all_tcs.select(event_names=["2017242N16333"])
data_irma = data.loc[1454, :].to_frame().T
data_irma
region_id | 28 | 44 | 92 | 132 | 192 | 212 | 214 | 388 | 484 | 630 | 659 | 662 | 670 | 796 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1454 | 211176356.8 | 1801876.321 | 3.000000e+09 | NaN | 1.099275e+10 | NaN | NaN | NaN | NaN | NaN | 18484889.46 | NaN | NaN | 500000000.0 |
Let’s first calibrate the impact function only on this event, including all data we have.
from climada.util.calibrate import (
Input,
BayesianOptimizer,
OutputEvaluator,
BayesianOptimizerOutputEvaluator,
)
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
def calibrate(hazard, data, **input_kwargs):
"""Calibrate using custom hazard and data"""
# Define calibration input
input = Input(
hazard=hazard,
exposure=exposure,
data=data,
impact_func_creator=impact_func_tc,
cost_func=mean_squared_log_error,
impact_to_dataframe=lambda imp: imp.impact_at_reg(exposure.gdf["region_id"]),
bounds=bounds,
**input_kwargs,
)
# Create and run the optimizer
with log_level("INFO", name_prefix="climada.util.calibrate"):
opt = BayesianOptimizer(input)
controller = BayesianOptimizerController.from_input(input)
bayes_output = opt.run(controller)
# Evaluate output
output_eval = OutputEvaluator(input, bayes_output)
output_eval.impf_set.plot()
plt.figure() # New figure because seaborn.heatmap draws into active axes
output_eval.plot_event_region_heatmap(
data_transf=lambda x: x.rename(
index=event_id_to_name, columns=country_code_to_name
)
)
return bayes_output.params # The optimal parameters
param_irma = calibrate(hazard_irma, data_irma)
2024-02-20 13:27:01,235 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:02,398 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:03,922 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-02-20 13:27:05,573 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
If we now remove some of the damage reports and repeat the calibration, the respective impact computed by the model will be ignored.
For Saint Kitts and Nevis, and for Turks and the Caicos Islands, the impact is overestimated by the model.
Removing these regions from the estimation should shift the estimated parameters accordingly, because by default, impacts for missing data points are ignored with missing_data_value=np.nan
.
calibrate(hazard_irma, data_irma.drop(columns=[659, 796]))
2024-02-20 13:27:12,398 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:13,605 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:14,991 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 2
2024-02-20 13:27:16,771 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 3
2024-02-20 13:27:18,916 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
{'scale': 0.9893201575415296, 'v_half': 46.03113797418196}
However, the calibration should change into the other direction once we require the modeled impact at missing data points to be zero:
calibrate(hazard_irma, data_irma.drop(columns=[659, 796]), missing_data_value=0)
2024-02-20 13:27:25,785 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 0
2024-02-20 13:27:26,845 - climada.util.calibrate.bayesian_optimizer - INFO - Minimal improvement. Stop iteration.
2024-02-20 13:27:26,845 - climada.util.calibrate.bayesian_optimizer - INFO - Optimization iteration: 1
2024-02-20 13:27:28,241 - climada.util.calibrate.bayesian_optimizer - INFO - No improvement. Stop optimization.
{'scale': 0.07120177027571553, 'v_half': 134.9655530108171}
Actively requiring that the model calibrates towards zero impact in the two dropped regions means that it will typically strongly overestimate the impact there (because impact actually took place). This will “flatten” the vulnerability curve, causing strong underestimation in the other regions.
How to Continue#
While the found impact function looks reasonable, we find that the model overestimates the impact severely for several events. This might be due to missing data, but it is also strongly related to the choice of impact function (shape) and the particular goal of the calibration task.
The most crucial information in the calibration task is the cost function. The RMSE measure is sensitive to the largest errors (and hence the largest impact). Therefore, using it in the calibration minimizes the overall error, but will incorrectly capture events with impact of lower orders of magnitude. Using a cost function based on the ratio between modelled and observed impact might increase the overall error but decrease the log-error for many events.
So we present some ideas on how to continue and/or improve the calibration:
Run the calibration again, but change the number of initial steps and/or iteration steps.
Use a different cost function, e.g., an error measure based on a ratio rather than a difference.
Also calibrate the
v_thresh
parameter. This requires adding constraints, becausev_thresh
<v_half
.Calibrate different impact functions for houses in Mexico and Puerto Rico within the same optimization task.
Employ the
ScipyMinimizeOptimizer
instead of theBayesianOptimizer
.