Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Harmonize climate and health data into a Chap-ready CSV file

This notebook demonstrates orchestration and harmonization across multiple sources to produce a single, modelling-ready CSV files compatible with the DHIS2 Chap Modeling Platform.

Scope (important):

Missing values are preserved as NaN (no imputation).

Spatial harmonization

To merge heterogeneous sources into a single modelling table, we must choose a common spatial unit (the modelling geography) and express all inputs on that unit.

In this example, we use administrative units from data/nepal-locations.geojson as the spatial spine. All datasets must map to these units before they can be merged.

DatasetNative spatial resolutionHarmonized resolutionNotes
OpenDengueMixed (Admin0/Admin1/Admin2)Admin units (GeoJSON)We consume the dengue-harmonized output from the dengue workflow, already mapped to the chosen admin level.
ERA5-LandRegular grid (~0.1°)Admin units (GeoJSON)We consume the admin-level output from the ERA5 workflow (already reduced over polygons).
CHIRPS3Regular grid (~0.05°)Admin units (GeoJSON)We consume the admin-level output from the CHIRPS3 workflow (already reduced over polygons).
WorldPopRaster grid (≈100m–1km, product dependent)Admin units (GeoJSON)We consume the admin-level output from the WorldPop workflow (aggregated by polygon sum).

Spatial harmonization aligns all sources to a shared modelling geography; it does not increase the native spatial precision of any source.

Temporal harmonization

To merge heterogeneous data sources into a single modelling table, we must choose a common temporal resolution (the modelling clock) and express all inputs on that axis.

In this example, we use a monthly time step.

DatasetNative temporal resolutionHarmonized resolutionNotes
OpenDengueWeekly / irregularMonthlyWe consume the dengue-harmonized output from the dengue workflow, aggregated to one value per month and location.
ERA5-LandHourly / dailyMonthlyWe consume daily (or monthly) admin-level outputs and aggregate to monthly where needed.
CHIRPS3DailyMonthlyWe consume daily admin-level outputs and aggregate to monthly where needed.
WorldPopYearly (static)Monthly (expanded)We consume the upstream output where yearly totals are aggregated to admin units and expanded to monthly.

Temporal harmonization aligns all datasets on the same time axis; it does not increase the intrinsic temporal precision of any source. Static or sparsely sampled datasets remain static after alignment.

from pathlib import Path
import re

import pandas as pd

# ---------------------------------------------------------------------
# Parameters
# ---------------------------------------------------------------------
FREQ = "monthly"

# This notebook consumes harmonized outputs produced by earlier workflow notebooks.
DATA_DIR = Path("../data").resolve()

# Harmonized inputs (already aligned to the same orgunit and monthly time_period)
DENGUE_CSV = DATA_DIR / "nepal-dengue-monthly-admin.csv" # columns: location, time_period, disease_cases
POP_CSV    = DATA_DIR / "nepal-worldpop-monthly-admin.csv"    # columns: location, time_period, population
PRCP_CSV   = DATA_DIR / "nepal-era5-prcp-monthly-admin.csv"   # columns: location, time_period, precip_mm
T2M_CSV    = DATA_DIR / "nepal-era5-t2m-monthly-admin.csv"    # columns: location, time_period, t2m_c

# Output
OUTPUT_CSV = DATA_DIR / "nepal_dengue_pop_climate_chap.csv"

Helper utilities

Before merging sources, we apply a small set of checks to ensure all inputs already conform to the expected modelling contract (location, time_period, monthly resolution).

# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------

PERIOD_YYYYMM = re.compile(r"^\d{6}$")

def read_csv(path: Path, required: list[str]) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(f"Missing input: {path}")

    df = pd.read_csv(path)

    # check for missing columns
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise KeyError(f"{path.name} missing {missing}. Found: {list(df.columns)}")

    # normalize keys
    df["location"] = df["location"].astype(str).str.strip()
    df["time_period"] = (
        df["time_period"].astype(str).str.replace(r"\D", "", regex=True).str.zfill(6)
    )

    # check period format
    bad = ~df["time_period"].str.match(PERIOD_YYYYMM)
    if bad.any():
        raise ValueError(f"Invalid time_period (YYYYMM). Examples: {df.loc[bad, 'time_period'].head(5).tolist()}")

    # drop missing location and periods
    df = df.dropna(subset=["location", "time_period"]).copy()

    # check for duplicates
    if df.duplicated(["location", "time_period"]).any():
        raise ValueError(f"{path.name} has duplicate (location, time_period) rows")

    return df

def month_index(start_yyyymm: str, end_yyyymm: str) -> pd.Index:
    return pd.period_range(start_yyyymm, end_yyyymm, freq="M").astype(str).str.replace("-", "", regex=False)

Load inputs (harmonized outputs)

At this point, the earlier workflow notebooks should have produced:

  • a dengue table at monthly resolution (location, time_period, disease_cases)

  • an admin-level WorldPop table (location, population)

  • admin-level climate time series for ERA5 (and optionally CHIRPS3)

# Load harmonized inputs (already monthly, already on the admin-unit spine)
dengue = read_csv(DENGUE_CSV, ["location", "time_period", "disease_cases"])
pop    = read_csv(POP_CSV,    ["location", "time_period", "population"])
prcp   = read_csv(PRCP_CSV,   ["location", "time_period", "tp"])
t2m    = read_csv(T2M_CSV,    ["location", "time_period", "t2m_c"])

# Coerce numeric columns (preserve missing values as NaN)
dengue["disease_cases"] = pd.to_numeric(dengue["disease_cases"], errors="coerce")
pop["population"]       = pd.to_numeric(pop["population"], errors="coerce")
prcp["tp"]              = pd.to_numeric(prcp["tp"], errors="coerce")
t2m["t2m_c"]            = pd.to_numeric(t2m["t2m_c"], errors="coerce")

# Quick shape summary
summary = pd.DataFrame({
    "rows": [len(dengue), len(pop), len(prcp), len(t2m)],
    "locations": [dengue["location"].nunique(), pop["location"].nunique(), prcp["location"].nunique(), t2m["location"].nunique()],
    "months": [dengue["time_period"].nunique(), pop["time_period"].nunique(), prcp["time_period"].nunique(), t2m["time_period"].nunique()],
}, index=["dengue", "population", "precip", "t2m"])

summary
Loading...

Build the modelling grid (admin unit × monthly time_period)

We use the dengue time range as the default modelling window, since dengue is the target outcome.

All locations from the spatial spine are included. Missing values are preserved as NaN.

# Determine global modelling window from dengue (target outcome)
start_tp = dengue["time_period"].min()
end_tp = dengue["time_period"].max()

months = month_index(start_tp, end_tp)

# Spatial spine: union of all locations present across harmonized inputs
locations = pd.Index(
    pd.unique(pd.concat([
        dengue["location"],
        pop["location"],
        prcp["location"],
        t2m["location"],
    ], ignore_index=True)).astype(str),
    name="location"
).sort_values()

# Full modelling grid: all locations × all months
grid = pd.MultiIndex.from_product([locations, months], names=["location", "time_period"]).to_frame(index=False)

grid
Loading...

Merge all sources onto the modelling grid

We join everything onto the modelling grid using (location, time_period).

  • Dengue: monthly outcome (disease_cases)

  • Population: monthly covariate (population)

  • Climate: monthly covariates (tp, t2m_c)

Missing values are preserved as NaN (no imputation).

# Start from the grid
df = grid.copy()

# Outcome + covariates (all keyed by location + time_period)
df = df.merge(dengue, on=["location", "time_period"], how="left")
df = df.merge(pop,    on=["location", "time_period"], how="left")
df = df.merge(prcp,   on=["location", "time_period"], how="left")
df = df.merge(t2m,    on=["location", "time_period"], how="left")

df
Loading...

Structural sanity checks

We avoid deep data validation in this guide. These checks confirm that the merged table satisfies the structural assumptions expected by downstream modelling (in our case Chap).

# One row per (location, time_period)
assert not df.duplicated(["location", "time_period"]).any(), "Duplicate (location, time_period) rows found."

# Required Chap fields present
required_cols = {"location", "time_period", "disease_cases"}
missing = required_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"

# Missingness summary (expected; no thresholds enforced here)
print('Missing values (share of total, 0-1):')
df.isna().mean().sort_values(ascending=False).head(20)
Missing values (share of total, 0-1):
location 0.0 time_period 0.0 disease_cases 0.0 population 0.0 tp 0.0 t2m_c 0.0 dtype: float64

Export to a Chap-compatible CSV

We use the Chap CSV exporter from dhis2eo. Reserved fields:

  • time_period, location, disease_cases

  • optional reserved field: population

All other columns are treated as covariates.

from dhis2eo.integrations.chap import dataframe_to_chap_csv

column_map = {
    "time_period": "time_period",
    "location": "location",
    "disease_cases": "disease_cases",
    "population": "population",   # optional but recommended
}

dataframe_to_chap_csv(
    df=df,
    output_path=OUTPUT_CSV,
    freq="monthly",
    column_map=column_map,
)

OUTPUT_CSV
WindowsPath('C:/Users/karimba/Documents/Github/climate-tools/docs/guides/data/nepal_dengue_pop_climate_chap.csv')

To inspect the contents of the final CSV file:

df_chap = pd.read_csv(OUTPUT_CSV)
df_chap
Loading...