Downloading and Harmonizing Dengue Data from OpenDengue

This guide demonstrates how to download and harmonize OpenDengue case data for use with DHIS2. The same approach can also be applied to local Dengue case counts from official Ministry of Health data.

The notebook focuses on data harmonization and preparation using a worked example for Nepal (districts / admin2) and monthly data.

Inputs¶

This workflow expects two local input files under ../../data/:

nepal-opendengue.csv — OpenDengue export containing Nepal dengue case counts
nepal-locations.geojson — Nepal district organisation units (admin2)

Output¶

The workflow produces:

nepal-dengue-harmonized.csv — harmonized monthly dengue cases per district (time_period, location, disease_cases)

from pathlib import Path

import pandas as pd
import geopandas as gpd

pd.set_option("display.max_columns", 200)

Paths¶

DATA_FOLDER = Path("../../data")

LOCATIONS_GEOJSON = DATA_FOLDER / "nepal-locations.geojson"
OPENDENGUE_SOURCE_PATH = DATA_FOLDER / "nepal-opendengue.csv"

# Output
OUT_CSV = DATA_FOLDER / "nepal-dengue-monthly-admin.csv"

for p in [LOCATIONS_GEOJSON, OPENDENGUE_SOURCE_PATH]:
    if not p.exists():
        raise FileNotFoundError(f"Missing required input: {p}")

print("Using inputs:")
print(" -", LOCATIONS_GEOJSON)
print(" -", OPENDENGUE_SOURCE_PATH)

Using inputs:
 - ..\..\data\nepal-locations.geojson
 - ..\..\data\nepal-opendengue.csv

Load DHIS2 district locations¶

locations = gpd.read_file(LOCATIONS_GEOJSON)

# DHIS2 UID 
uid_col = "id" if "id" in locations.columns else None
if uid_col is None:
    raise KeyError(f"Expected DHIS2 UID in GeoJSON 'id'. Found: {list(locations.columns)}")

locations["location"] = locations[uid_col].astype(str).str.strip() 

# Join helper (district name)
if "name" not in locations.columns:
    raise KeyError(f"Expected district name in GeoJSON 'name'. Found: {list(locations.columns)}")

locations["district_name"] = (
    locations["name"].astype(str)
    .str.replace(r"^\s*\d+\s+", "", regex=True)  # drop "101 that came with location names"
    .str.upper()
    .str.strip()
)

# Keep only what we need
locations = locations[["location", "district_name", "geometry"]].dropna(subset=["location"]).copy()
locations

Load OpenDengue¶

df_raw = pd.read_csv(OPENDENGUE_SOURCE_PATH)
print("Loaded:", OPENDENGUE_SOURCE_PATH)
print("Columns:", df_raw.columns.tolist())
df_raw.head()

Loaded: ..\..\data\nepal-opendengue.csv
Columns: ['adm_0_name', 'adm_1_name', 'adm_2_name', 'full_name', 'ISO_A0', 'FAO_GAUL_code', 'RNE_iso_code', 'IBGE_code', 'calendar_start_date', 'calendar_end_date', 'Year', 'dengue_total', 'case_definition_standardised', 'S_res', 'T_res', 'UUID', 'region']

OpenDengue contains multiple administrative levels in the same file, so we subset to only the Admin 2 units.

df_raw = df_raw[df_raw['S_res']=='Admin2']
print('Number of rows after filtering to admin2 units:', len(df_raw))

Number of rows after filtering to admin2 units: 2772

Column mapping¶

# OpenDengue export columns
DATE_COL = "calendar_start_date"
CASES_COL = "dengue_total"
ADMIN2_COL = "adm_2_name"

missing = [c for c in [DATE_COL, CASES_COL, ADMIN2_COL] if c not in df_raw.columns]
if missing:
    raise KeyError(
        f"Input CSV is missing required columns: {missing}. "
        f"Available columns: {df_raw.columns.tolist()}"
    )

print("Using columns:", {"date": DATE_COL, "cases": CASES_COL, "admin2": ADMIN2_COL})

Using columns: {'date': 'calendar_start_date', 'cases': 'dengue_total', 'admin2': 'adm_2_name'}

Normalize OpenDengue (Nepal districts)¶

df_norm = pd.DataFrame({
    "date": pd.to_datetime(df_raw[DATE_COL], errors="coerce"),
    "cases": pd.to_numeric(df_raw[CASES_COL], errors="coerce"),
    "district_name": df_raw[ADMIN2_COL],   # <-- not location yet
})

# Normalize district name for the crosswalk join
df_norm["district_name"] = (
    df_norm["district_name"]
    .astype(str)
    .str.upper()
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)

# Keep only valid rows
# Map district_name -> DHIS2 orgUnit UID
df_norm = df_norm.merge(
    locations[["district_name", "location"]],
    on="district_name",
    how="left",
)

# Fail fast (or drop) if mapping is incomplete
unmapped = df_norm["location"].isna().mean()
print(f"Unmapped dengue rows: {unmapped:.2%}")
if unmapped > 0:
    print("Examples:", df_norm.loc[df_norm["location"].isna(), "district_name"].drop_duplicates().head(15).tolist())

# Drop rows with missing values
df_norm = df_norm.dropna(subset=["location"]).copy()
df_norm = df_norm.dropna(subset=["date", "cases", "district_name"])

df_norm

Unmapped dengue rows: 1.30%
Examples: ['CHITAWAN']

Monthly aggregation¶

# Convert to month period label (YYYY-MM)
df_norm["time_period"] = df_norm["date"].dt.to_period("M").astype(str)

# Aggregate within month + location
disease = (
    df_norm.groupby(["time_period", "location"], as_index=False)["cases"]
    .sum()
    .rename(columns={"cases": "disease_cases"})
)

print("Aggregated rows:", len(disease))
disease

Aggregated rows: 2736

Filter to districts and align time axis¶

# Keep only locations present in the GeoJSON backbone
before = len(disease)
disease = disease.merge(locations[["location"]], on="location", how="inner")
after = len(disease)
print(f"Backbone filter kept {after}/{before} rows")

# Build full (time_period x location) grid and fill missing with 0
all_months = pd.period_range(disease["time_period"].min(), disease["time_period"].max(), freq="M").astype(str)
all_locations = locations["location"].sort_values().unique()

grid = pd.MultiIndex.from_product([all_months, all_locations], names=["time_period", "location"]).to_frame(index=False)

disease_full = grid.merge(disease, on=["time_period", "location"], how="left")
disease_full["disease_cases"] = disease_full["disease_cases"].fillna(0)

# Keep integer-looking values as ints where possible
disease_full["disease_cases"] = pd.to_numeric(disease_full["disease_cases"], errors="coerce").fillna(0).astype(int)

print("Final rows (complete grid):", len(disease_full))
disease_full

Backbone filter kept 2736/2736 rows
Final rows (complete grid): 2772

Write output CSV¶

disease_full.to_csv(OUT_CSV, index=False)
print("Wrote:", OUT_CSV)
OUT_CSV

Wrote: ..\..\data\nepal-dengue-monthly-admin.csv

WindowsPath('../../data/nepal-dengue-monthly-admin.csv')

Next steps¶

This guide stops after downloading and producing a harmonized, DHIS2-ready OpenDengue dataset for Nepal.

To import the resulting data into DHIS2, see the section for importing data to DHIS2.