This guide demonstrates how to download and harmonize OpenDengue case data for use with DHIS2. The same approach can also be applied to local Dengue case counts from official Ministry of Health data.
The notebook focuses on data harmonization and preparation using a worked example for Nepal (districts / admin2) and monthly data.
Inputs¶
This workflow expects two local input files under ../../data/:
nepal-opendengue.csv— OpenDengue export containing Nepal dengue case countsnepal-locations.geojson— Nepal district organisation units (admin2)
Output¶
The workflow produces:
nepal-dengue-harmonized.csv— harmonized monthly dengue cases per district (time_period,location,disease_cases)
from pathlib import Path
import pandas as pd
import geopandas as gpd
pd.set_option("display.max_columns", 200)Paths¶
DATA_FOLDER = Path("../../data")
LOCATIONS_GEOJSON = DATA_FOLDER / "nepal-locations.geojson"
OPENDENGUE_SOURCE_PATH = DATA_FOLDER / "nepal-opendengue.csv"
# Output
OUT_CSV = DATA_FOLDER / "nepal-dengue-monthly-admin.csv"
for p in [LOCATIONS_GEOJSON, OPENDENGUE_SOURCE_PATH]:
if not p.exists():
raise FileNotFoundError(f"Missing required input: {p}")
print("Using inputs:")
print(" -", LOCATIONS_GEOJSON)
print(" -", OPENDENGUE_SOURCE_PATH)
Using inputs:
- ..\..\data\nepal-locations.geojson
- ..\..\data\nepal-opendengue.csv
Load DHIS2 district locations¶
locations = gpd.read_file(LOCATIONS_GEOJSON)
# DHIS2 UID
uid_col = "id" if "id" in locations.columns else None
if uid_col is None:
raise KeyError(f"Expected DHIS2 UID in GeoJSON 'id'. Found: {list(locations.columns)}")
locations["location"] = locations[uid_col].astype(str).str.strip()
# Join helper (district name)
if "name" not in locations.columns:
raise KeyError(f"Expected district name in GeoJSON 'name'. Found: {list(locations.columns)}")
locations["district_name"] = (
locations["name"].astype(str)
.str.replace(r"^\s*\d+\s+", "", regex=True) # drop "101 that came with location names"
.str.upper()
.str.strip()
)
# Keep only what we need
locations = locations[["location", "district_name", "geometry"]].dropna(subset=["location"]).copy()
locationsLoad OpenDengue¶
df_raw = pd.read_csv(OPENDENGUE_SOURCE_PATH)
print("Loaded:", OPENDENGUE_SOURCE_PATH)
print("Columns:", df_raw.columns.tolist())
df_raw.head()Loaded: ..\..\data\nepal-opendengue.csv
Columns: ['adm_0_name', 'adm_1_name', 'adm_2_name', 'full_name', 'ISO_A0', 'FAO_GAUL_code', 'RNE_iso_code', 'IBGE_code', 'calendar_start_date', 'calendar_end_date', 'Year', 'dengue_total', 'case_definition_standardised', 'S_res', 'T_res', 'UUID', 'region']
OpenDengue contains multiple administrative levels in the same file, so we subset to only the Admin 2 units.
df_raw = df_raw[df_raw['S_res']=='Admin2']
print('Number of rows after filtering to admin2 units:', len(df_raw))Number of rows after filtering to admin2 units: 2772
Column mapping¶
# OpenDengue export columns
DATE_COL = "calendar_start_date"
CASES_COL = "dengue_total"
ADMIN2_COL = "adm_2_name"
missing = [c for c in [DATE_COL, CASES_COL, ADMIN2_COL] if c not in df_raw.columns]
if missing:
raise KeyError(
f"Input CSV is missing required columns: {missing}. "
f"Available columns: {df_raw.columns.tolist()}"
)
print("Using columns:", {"date": DATE_COL, "cases": CASES_COL, "admin2": ADMIN2_COL})Using columns: {'date': 'calendar_start_date', 'cases': 'dengue_total', 'admin2': 'adm_2_name'}
Normalize OpenDengue (Nepal districts)¶
df_norm = pd.DataFrame({
"date": pd.to_datetime(df_raw[DATE_COL], errors="coerce"),
"cases": pd.to_numeric(df_raw[CASES_COL], errors="coerce"),
"district_name": df_raw[ADMIN2_COL], # <-- not location yet
})
# Normalize district name for the crosswalk join
df_norm["district_name"] = (
df_norm["district_name"]
.astype(str)
.str.upper()
.str.strip()
.str.replace(r"\s+", " ", regex=True)
)
# Keep only valid rows
# Map district_name -> DHIS2 orgUnit UID
df_norm = df_norm.merge(
locations[["district_name", "location"]],
on="district_name",
how="left",
)
# Fail fast (or drop) if mapping is incomplete
unmapped = df_norm["location"].isna().mean()
print(f"Unmapped dengue rows: {unmapped:.2%}")
if unmapped > 0:
print("Examples:", df_norm.loc[df_norm["location"].isna(), "district_name"].drop_duplicates().head(15).tolist())
# Drop rows with missing values
df_norm = df_norm.dropna(subset=["location"]).copy()
df_norm = df_norm.dropna(subset=["date", "cases", "district_name"])
df_normUnmapped dengue rows: 1.30%
Examples: ['CHITAWAN']
Monthly aggregation¶
# Convert to month period label (YYYY-MM)
df_norm["time_period"] = df_norm["date"].dt.to_period("M").astype(str)
# Aggregate within month + location
disease = (
df_norm.groupby(["time_period", "location"], as_index=False)["cases"]
.sum()
.rename(columns={"cases": "disease_cases"})
)
print("Aggregated rows:", len(disease))
diseaseAggregated rows: 2736
Filter to districts and align time axis¶
# Keep only locations present in the GeoJSON backbone
before = len(disease)
disease = disease.merge(locations[["location"]], on="location", how="inner")
after = len(disease)
print(f"Backbone filter kept {after}/{before} rows")
# Build full (time_period x location) grid and fill missing with 0
all_months = pd.period_range(disease["time_period"].min(), disease["time_period"].max(), freq="M").astype(str)
all_locations = locations["location"].sort_values().unique()
grid = pd.MultiIndex.from_product([all_months, all_locations], names=["time_period", "location"]).to_frame(index=False)
disease_full = grid.merge(disease, on=["time_period", "location"], how="left")
disease_full["disease_cases"] = disease_full["disease_cases"].fillna(0)
# Keep integer-looking values as ints where possible
disease_full["disease_cases"] = pd.to_numeric(disease_full["disease_cases"], errors="coerce").fillna(0).astype(int)
print("Final rows (complete grid):", len(disease_full))
disease_fullBackbone filter kept 2736/2736 rows
Final rows (complete grid): 2772
Write output CSV¶
disease_full.to_csv(OUT_CSV, index=False)
print("Wrote:", OUT_CSV)
OUT_CSVWrote: ..\..\data\nepal-dengue-monthly-admin.csv
WindowsPath('../../data/nepal-dengue-monthly-admin.csv')Next steps¶
This guide stops after downloading and producing a harmonized, DHIS2-ready OpenDengue dataset for Nepal.
To import the resulting data into DHIS2, see the section for importing data to DHIS2.