This notebook demonstrates orchestration and harmonization across multiple sources to produce a single, modelling-ready CSV files compatible with the DHIS2 Chap Modeling Platform.
Scope (important):
We consume harmonized outputs produced by the earlier workflow notebooks (OpenDengue, WorldPop, ERA5, CHIRPS3).
We do not repeat source-specific harmonization choices here (e.g. filtering mixed resolutions in OpenDengue).
Our job in this notebook is to align all inputs onto a shared modelling grid: (admin unit × monthly time_period), merge them, and export a single CSV.
Missing values are preserved as NaN (no imputation).
Spatial harmonization¶
To merge heterogeneous sources into a single modelling table, we must choose a common spatial unit (the modelling geography) and express all inputs on that unit.
In this example, we use administrative units from data/nepal-locations.geojson as
the spatial spine. All datasets must map to these units before they can be merged.
| Dataset | Native spatial resolution | Harmonized resolution | Notes |
|---|---|---|---|
| OpenDengue | Mixed (Admin0/Admin1/Admin2) | Admin units (GeoJSON) | We consume the dengue-harmonized output from the dengue workflow, already mapped to the chosen admin level. |
| ERA5-Land | Regular grid (~0.1°) | Admin units (GeoJSON) | We consume the admin-level output from the ERA5 workflow (already reduced over polygons). |
| CHIRPS3 | Regular grid (~0.05°) | Admin units (GeoJSON) | We consume the admin-level output from the CHIRPS3 workflow (already reduced over polygons). |
| WorldPop | Raster grid (≈100m–1km, product dependent) | Admin units (GeoJSON) | We consume the admin-level output from the WorldPop workflow (aggregated by polygon sum). |
Spatial harmonization aligns all sources to a shared modelling geography; it does not increase the native spatial precision of any source.
Temporal harmonization¶
To merge heterogeneous data sources into a single modelling table, we must choose a common temporal resolution (the modelling clock) and express all inputs on that axis.
In this example, we use a monthly time step.
| Dataset | Native temporal resolution | Harmonized resolution | Notes |
|---|---|---|---|
| OpenDengue | Weekly / irregular | Monthly | We consume the dengue-harmonized output from the dengue workflow, aggregated to one value per month and location. |
| ERA5-Land | Hourly / daily | Monthly | We consume daily (or monthly) admin-level outputs and aggregate to monthly where needed. |
| CHIRPS3 | Daily | Monthly | We consume daily admin-level outputs and aggregate to monthly where needed. |
| WorldPop | Yearly (static) | Monthly (expanded) | We consume the upstream output where yearly totals are aggregated to admin units and expanded to monthly. |
Temporal harmonization aligns all datasets on the same time axis; it does not increase the intrinsic temporal precision of any source. Static or sparsely sampled datasets remain static after alignment.
from pathlib import Path
import re
import pandas as pd
# ---------------------------------------------------------------------
# Parameters
# ---------------------------------------------------------------------
FREQ = "monthly"
# This notebook consumes harmonized outputs produced by earlier workflow notebooks.
DATA_DIR = Path("../data").resolve()
# Harmonized inputs (already aligned to the same orgunit and monthly time_period)
DENGUE_CSV = DATA_DIR / "nepal-dengue-monthly-admin.csv" # columns: location, time_period, disease_cases
POP_CSV = DATA_DIR / "nepal-worldpop-monthly-admin.csv" # columns: location, time_period, population
PRCP_CSV = DATA_DIR / "nepal-era5-prcp-monthly-admin.csv" # columns: location, time_period, precip_mm
T2M_CSV = DATA_DIR / "nepal-era5-t2m-monthly-admin.csv" # columns: location, time_period, t2m_c
# Output
OUTPUT_CSV = DATA_DIR / "nepal_dengue_pop_climate_chap.csv"Helper utilities¶
Before merging sources, we apply a small set of checks to ensure all inputs already conform to the expected modelling contract (location, time_period, monthly resolution).
# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------
PERIOD_YYYYMM = re.compile(r"^\d{6}$")
def read_csv(path: Path, required: list[str]) -> pd.DataFrame:
if not path.exists():
raise FileNotFoundError(f"Missing input: {path}")
df = pd.read_csv(path)
# check for missing columns
missing = [c for c in required if c not in df.columns]
if missing:
raise KeyError(f"{path.name} missing {missing}. Found: {list(df.columns)}")
# normalize keys
df["location"] = df["location"].astype(str).str.strip()
df["time_period"] = (
df["time_period"].astype(str).str.replace(r"\D", "", regex=True).str.zfill(6)
)
# check period format
bad = ~df["time_period"].str.match(PERIOD_YYYYMM)
if bad.any():
raise ValueError(f"Invalid time_period (YYYYMM). Examples: {df.loc[bad, 'time_period'].head(5).tolist()}")
# drop missing location and periods
df = df.dropna(subset=["location", "time_period"]).copy()
# check for duplicates
if df.duplicated(["location", "time_period"]).any():
raise ValueError(f"{path.name} has duplicate (location, time_period) rows")
return df
def month_index(start_yyyymm: str, end_yyyymm: str) -> pd.Index:
return pd.period_range(start_yyyymm, end_yyyymm, freq="M").astype(str).str.replace("-", "", regex=False)
Load inputs (harmonized outputs)¶
At this point, the earlier workflow notebooks should have produced:
a dengue table at monthly resolution (
location,time_period,disease_cases)an admin-level WorldPop table (
location, population)admin-level climate time series for ERA5 (and optionally CHIRPS3)
# Load harmonized inputs (already monthly, already on the admin-unit spine)
dengue = read_csv(DENGUE_CSV, ["location", "time_period", "disease_cases"])
pop = read_csv(POP_CSV, ["location", "time_period", "population"])
prcp = read_csv(PRCP_CSV, ["location", "time_period", "tp"])
t2m = read_csv(T2M_CSV, ["location", "time_period", "t2m_c"])
# Coerce numeric columns (preserve missing values as NaN)
dengue["disease_cases"] = pd.to_numeric(dengue["disease_cases"], errors="coerce")
pop["population"] = pd.to_numeric(pop["population"], errors="coerce")
prcp["tp"] = pd.to_numeric(prcp["tp"], errors="coerce")
t2m["t2m_c"] = pd.to_numeric(t2m["t2m_c"], errors="coerce")
# Quick shape summary
summary = pd.DataFrame({
"rows": [len(dengue), len(pop), len(prcp), len(t2m)],
"locations": [dengue["location"].nunique(), pop["location"].nunique(), prcp["location"].nunique(), t2m["location"].nunique()],
"months": [dengue["time_period"].nunique(), pop["time_period"].nunique(), prcp["time_period"].nunique(), t2m["time_period"].nunique()],
}, index=["dengue", "population", "precip", "t2m"])
summaryBuild the modelling grid (admin unit × monthly time_period)¶
We use the dengue time range as the default modelling window, since dengue is the target outcome.
All locations from the spatial spine are included. Missing values are preserved as NaN.
# Determine global modelling window from dengue (target outcome)
start_tp = dengue["time_period"].min()
end_tp = dengue["time_period"].max()
months = month_index(start_tp, end_tp)
# Spatial spine: union of all locations present across harmonized inputs
locations = pd.Index(
pd.unique(pd.concat([
dengue["location"],
pop["location"],
prcp["location"],
t2m["location"],
], ignore_index=True)).astype(str),
name="location"
).sort_values()
# Full modelling grid: all locations × all months
grid = pd.MultiIndex.from_product([locations, months], names=["location", "time_period"]).to_frame(index=False)
gridMerge all sources onto the modelling grid¶
We join everything onto the modelling grid using (location, time_period).
Dengue: monthly outcome (
disease_cases)Population: monthly covariate (
population)Climate: monthly covariates (
tp,t2m_c)
Missing values are preserved as NaN (no imputation).
# Start from the grid
df = grid.copy()
# Outcome + covariates (all keyed by location + time_period)
df = df.merge(dengue, on=["location", "time_period"], how="left")
df = df.merge(pop, on=["location", "time_period"], how="left")
df = df.merge(prcp, on=["location", "time_period"], how="left")
df = df.merge(t2m, on=["location", "time_period"], how="left")
dfStructural sanity checks¶
We avoid deep data validation in this guide. These checks confirm that the merged table satisfies the structural assumptions expected by downstream modelling (in our case Chap).
# One row per (location, time_period)
assert not df.duplicated(["location", "time_period"]).any(), "Duplicate (location, time_period) rows found."
# Required Chap fields present
required_cols = {"location", "time_period", "disease_cases"}
missing = required_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"
# Missingness summary (expected; no thresholds enforced here)
print('Missing values (share of total, 0-1):')
df.isna().mean().sort_values(ascending=False).head(20)Missing values (share of total, 0-1):
location 0.0
time_period 0.0
disease_cases 0.0
population 0.0
tp 0.0
t2m_c 0.0
dtype: float64Export to a Chap-compatible CSV¶
We use the Chap CSV exporter from dhis2eo. Reserved fields:
time_period,location,disease_casesoptional reserved field:
population
All other columns are treated as covariates.
from dhis2eo.integrations.chap import dataframe_to_chap_csv
column_map = {
"time_period": "time_period",
"location": "location",
"disease_cases": "disease_cases",
"population": "population", # optional but recommended
}
dataframe_to_chap_csv(
df=df,
output_path=OUTPUT_CSV,
freq="monthly",
column_map=column_map,
)
OUTPUT_CSVWindowsPath('C:/Users/karimba/Documents/Github/climate-tools/docs/guides/data/nepal_dengue_pop_climate_chap.csv')To inspect the contents of the final CSV file:
df_chap = pd.read_csv(OUTPUT_CSV)
df_chap