Harmonize climate and health data into a Chap-ready CSV file

This notebook demonstrates orchestration and harmonization across multiple sources to produce a single, modelling-ready CSV files compatible with the DHIS2 Chap Modeling Platform.

Scope (important):

We consume harmonized outputs produced by the earlier workflow notebooks (OpenDengue, WorldPop, ERA5, CHIRPS3).
We do not repeat source-specific harmonization choices here (e.g. filtering mixed resolutions in OpenDengue).
Our job in this notebook is to align all inputs onto a shared modelling grid: (admin unit × monthly time_period), merge them, and export a single CSV.

Missing values are preserved as NaN (no imputation).

Spatial harmonization¶

To merge heterogeneous sources into a single modelling table, we must choose a common spatial unit (the modelling geography) and express all inputs on that unit.

In this example, we use administrative units from data/nepal-locations.geojson as the spatial spine. All datasets must map to these units before they can be merged.

Dataset	Native spatial resolution	Harmonized resolution	Notes
OpenDengue	Mixed (Admin0/Admin1/Admin2)	Admin units (GeoJSON)	We consume the dengue-harmonized output from the dengue workflow, already mapped to the chosen admin level.
ERA5-Land	Regular grid (~0.1°)	Admin units (GeoJSON)	We consume the admin-level output from the ERA5 workflow (already reduced over polygons).
CHIRPS3	Regular grid (~0.05°)	Admin units (GeoJSON)	We consume the admin-level output from the CHIRPS3 workflow (already reduced over polygons).
WorldPop	Raster grid (≈100m–1km, product dependent)	Admin units (GeoJSON)	We consume the admin-level output from the WorldPop workflow (aggregated by polygon sum).

Spatial harmonization aligns all sources to a shared modelling geography; it does not increase the native spatial precision of any source.

Temporal harmonization¶

To merge heterogeneous data sources into a single modelling table, we must choose a common temporal resolution (the modelling clock) and express all inputs on that axis.

In this example, we use a monthly time step.

Dataset	Native temporal resolution	Harmonized resolution	Notes
OpenDengue	Weekly / irregular	Monthly	We consume the dengue-harmonized output from the dengue workflow, aggregated to one value per month and location.
ERA5-Land	Hourly / daily	Monthly	We consume daily (or monthly) admin-level outputs and aggregate to monthly where needed.
CHIRPS3	Daily	Monthly	We consume daily admin-level outputs and aggregate to monthly where needed.
WorldPop	Yearly (static)	Monthly (expanded)	We consume the upstream output where yearly totals are aggregated to admin units and expanded to monthly.

Temporal harmonization aligns all datasets on the same time axis; it does not increase the intrinsic temporal precision of any source. Static or sparsely sampled datasets remain static after alignment.

from pathlib import Path
import re

import pandas as pd

# ---------------------------------------------------------------------
# Parameters
# ---------------------------------------------------------------------
FREQ = "monthly"

# This notebook consumes harmonized outputs produced by earlier workflow notebooks.
DATA_DIR = Path("../data").resolve()

# Harmonized inputs (already aligned to the same orgunit and monthly time_period)
DENGUE_CSV = DATA_DIR / "nepal-dengue-monthly-admin.csv" # columns: location, time_period, disease_cases
POP_CSV    = DATA_DIR / "nepal-worldpop-monthly-admin.csv"    # columns: location, time_period, population
PRCP_CSV   = DATA_DIR / "nepal-era5-prcp-monthly-admin.csv"   # columns: location, time_period, precip_mm
T2M_CSV    = DATA_DIR / "nepal-era5-t2m-monthly-admin.csv"    # columns: location, time_period, t2m_c

# Output
OUTPUT_CSV = DATA_DIR / "nepal_dengue_pop_climate_chap.csv"

Helper utilities¶

Before merging sources, we apply a small set of checks to ensure all inputs already conform to the expected modelling contract (location, time_period, monthly resolution).

# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------

PERIOD_YYYYMM = re.compile(r"^\d{6}$")

def read_csv(path: Path, required: list[str]) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Missing input: {path}")

    df = pd.read_csv(path)

    # check for missing columns
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise KeyError(f"{path.name} missing {missing}. Found: {list(df.columns)}")

    # normalize keys
    df["location"] = df["location"].astype(str).str.strip()
    df["time_period"] = (
        df["time_period"].astype(str).str.replace(r"\D", "", regex=True).str.zfill(6)
    )

    # check period format
    bad = ~df["time_period"].str.match(PERIOD_YYYYMM)
    if bad.any():
        raise ValueError(f"Invalid time_period (YYYYMM). Examples: {df.loc[bad, 'time_period'].head(5).tolist()}")

    # drop missing location and periods
    df = df.dropna(subset=["location", "time_period"]).copy()

    # check for duplicates
    if df.duplicated(["location", "time_period"]).any():
        raise ValueError(f"{path.name} has duplicate (location, time_period) rows")

    return df

def month_index(start_yyyymm: str, end_yyyymm: str) -> pd.Index:
    return pd.period_range(start_yyyymm, end_yyyymm, freq="M").astype(str).str.replace("-", "", regex=False)

Load inputs (harmonized outputs)¶

At this point, the earlier workflow notebooks should have produced:

a dengue table at monthly resolution (location, time_period, disease_cases)
an admin-level WorldPop table (location, population)
admin-level climate time series for ERA5 (and optionally CHIRPS3)

# Load harmonized inputs (already monthly, already on the admin-unit spine)
dengue = read_csv(DENGUE_CSV, ["location", "time_period", "disease_cases"])
pop    = read_csv(POP_CSV,    ["location", "time_period", "population"])
prcp   = read_csv(PRCP_CSV,   ["location", "time_period", "tp"])
t2m    = read_csv(T2M_CSV,    ["location", "time_period", "t2m_c"])

# Coerce numeric columns (preserve missing values as NaN)
dengue["disease_cases"] = pd.to_numeric(dengue["disease_cases"], errors="coerce")
pop["population"]       = pd.to_numeric(pop["population"], errors="coerce")
prcp["tp"]              = pd.to_numeric(prcp["tp"], errors="coerce")
t2m["t2m_c"]            = pd.to_numeric(t2m["t2m_c"], errors="coerce")

# Quick shape summary
summary = pd.DataFrame({
    "rows": [len(dengue), len(pop), len(prcp), len(t2m)],
    "locations": [dengue["location"].nunique(), pop["location"].nunique(), prcp["location"].nunique(), t2m["location"].nunique()],
    "months": [dengue["time_period"].nunique(), pop["time_period"].nunique(), prcp["time_period"].nunique(), t2m["time_period"].nunique()],
}, index=["dengue", "population", "precip", "t2m"])

summary

Build the modelling grid (admin unit × monthly time_period)¶

We use the dengue time range as the default modelling window, since dengue is the target outcome.

All locations from the spatial spine are included. Missing values are preserved as NaN.

# Determine global modelling window from dengue (target outcome)
start_tp = dengue["time_period"].min()
end_tp = dengue["time_period"].max()

months = month_index(start_tp, end_tp)

# Spatial spine: union of all locations present across harmonized inputs
locations = pd.Index(
    pd.unique(pd.concat([
        dengue["location"],
        pop["location"],
        prcp["location"],
        t2m["location"],
    ], ignore_index=True)).astype(str),
    name="location"
).sort_values()

# Full modelling grid: all locations × all months
grid = pd.MultiIndex.from_product([locations, months], names=["location", "time_period"]).to_frame(index=False)

grid

Merge all sources onto the modelling grid¶

We join everything onto the modelling grid using (location, time_period).

Dengue: monthly outcome (disease_cases)
Population: monthly covariate (population)
Climate: monthly covariates (tp, t2m_c)

Missing values are preserved as NaN (no imputation).

# Start from the grid
df = grid.copy()

# Outcome + covariates (all keyed by location + time_period)
df = df.merge(dengue, on=["location", "time_period"], how="left")
df = df.merge(pop,    on=["location", "time_period"], how="left")
df = df.merge(prcp,   on=["location", "time_period"], how="left")
df = df.merge(t2m,    on=["location", "time_period"], how="left")

df

Structural sanity checks¶

We avoid deep data validation in this guide. These checks confirm that the merged table satisfies the structural assumptions expected by downstream modelling (in our case Chap).

# One row per (location, time_period)
assert not df.duplicated(["location", "time_period"]).any(), "Duplicate (location, time_period) rows found."

# Required Chap fields present
required_cols = {"location", "time_period", "disease_cases"}
missing = required_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"

# Missingness summary (expected; no thresholds enforced here)
print('Missing values (share of total, 0-1):')
df.isna().mean().sort_values(ascending=False).head(20)

Missing values (share of total, 0-1):

location         0.0
time_period      0.0
disease_cases    0.0
population       0.0
tp               0.0
t2m_c            0.0
dtype: float64

Export to a Chap-compatible CSV¶

We use the Chap CSV exporter from dhis2eo. Reserved fields:

time_period, location, disease_cases
optional reserved field: population

All other columns are treated as covariates.

from dhis2eo.integrations.chap import dataframe_to_chap_csv

column_map = {
    "time_period": "time_period",
    "location": "location",
    "disease_cases": "disease_cases",
    "population": "population",   # optional but recommended
}

dataframe_to_chap_csv(
    df=df,
    output_path=OUTPUT_CSV,
    freq="monthly",
    column_map=column_map,
)

OUTPUT_CSV

WindowsPath('C:/Users/karimba/Documents/Github/climate-tools/docs/guides/data/nepal_dengue_pop_climate_chap.csv')

To inspect the contents of the final CSV file:

df_chap = pd.read_csv(OUTPUT_CSV)
df_chap