Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Population-weighted aggregation

When aggregating gridded climate data to administrative regions, a simple spatial average can be misleading if population is unevenly distributed across the region. Population weighting provides a more accurate measure of the conditions experienced by people, which is particularly important for health-related analyses.

Why use population weighting?

Consider a region where most people live in urban areas concentrated in one part of the territory, while the rest is sparsely populated. An unweighted spatial average would give equal importance to all grid cells, regardless of whether people actually live there. This means the resulting average may not accurately represent the conditions experienced by the typical person in that region.

Population weighting solves this problem by giving more weight to grid cells where more people live. This ensures that the aggregated climate values better reflect the actual exposure of the population.

Population-weighted aggregation is especially valuable when:

  • Analyzing health impacts of climate and environmental conditions

  • Population is concentrated in specific areas (cities, valleys, coastal zones)

  • The measured condition varies significantly across the region

  • Results will inform public health decisions or interventions

The population weighting method

The basic formula for population-weighted aggregation is:

Weighted Average=(Variable×Population)Population\text{Weighted Average} = \frac{\sum (\text{Variable} \times \text{Population})}{\sum \text{Population}}

In practice, this means:

  1. Align climate and population data to the same grid

  2. Multiply each grid cell varaible by the corresponding population count

  3. Sum these weighted values across the region

  4. Divide by the total population of the region

In this tutorial, we’ll demonstrate this workflow using PM2.5 air quality data and population data for two districts in Sri Lanka.

Import required libraries

We’ll use several Python libraries for this analysis:

  • xarray: For working with multi-dimensional climate data

  • geopandas: For handling geographic boundaries

  • earthkit.transform: For spatial aggregation functions

  • numpy: For numerical operations

import xarray as xr
import geopandas as gpd
import numpy as np
from earthkit.transforms import spatial

Load PM2.5 air quality data

We start by loading daily PM2.5 (particulate matter) air quality data for Sri Lanka from December 2022 through January 2023. To keep the dataset lightweight, it has been spatially restricted to the Colombo and Kalutara districts rather than the full country. Please note that the data is only included to demonstrate the population weighting method, and should be not be used to analyse PM2.5 levels.

This NetCDF file contains gridded PM2.5 concentration values across space and time:

file = "../data/pm25-sri-lanka-colombo-kalutara-dec-2022-jan-2023.nc"
pm_data = xr.open_dataset(file)
pm_data
Loading...

Load district boundaries

Next, we load the geographic boundaries for the two districts covered by the PM2.5 dataset, Colombo and Kalutara:

districts_file = "../data/sri-lanka-districts-colombo-kalutara.geojson"
districts = gpd.read_file(districts_file)
districts
Loading...

The districts were downloaded from Humanitarian Data Exchange (HDX).

Select data for one year

For this example, we’ll focus on PM2.5 data from 2023. We can easily filter the dataset by year using xarray’s datetime indexing:

pm_data_2023 = pm_data.sel(time=pm_data['time'].dt.year == 2023)
pm_data_2023
Loading...

Load population data

Now we load the population raster for Sri Lanka. This file contains population counts per grid cell at approximately 1 km resolution. Note that we rename the coordinate dimensions to match our PM2.5 data (lon and lat):

pop_file = "../data/lka_pop_2023_CN_1km_R2025A_UA_v1.tif"
pop_ds = xr.open_dataset(pop_file)
pop_data = pop_ds['band_data'].rename({"x": "lon", "y": "lat"})
pop_data
Loading...

Align population data to climate grid

Critical step: The population and PM2.5 data have different grid alignments. We need to resample the population data to match the PM2.5 grid exactly. We use nearest-neighbor interpolation to preserve population counts:

pop_aligned = pop_data.interp(lon=pm_data_2023.lon, lat=pm_data_2023.lat, method="nearest")

Calculate weighted PM2.5 values

Now we multiply each PM2.5 value by the corresponding population count. This gives us the “population-weighted PM2.5” for each grid cell:

pm_weighted = pm_data_2023 * pop_aligned

Aggregate to districts

Finally, we complete the population-weighted aggregation:

  1. Sum weighted PM2.5 (numerator): Total of all PM2.5 × population values within each district

  2. Sum population (denominator): Total population within each district

  3. Divide: This gives us the population-weighted average PM2.5 for each district

We use earthkit’s spatial.reduce function to perform spatial aggregation for each district polygon:

agg_num = spatial.reduce(pm_weighted, districts, how="sum", mask_dim="code")
agg_den = spatial.reduce(pop_aligned, districts, how="sum", mask_dim="code")

pw = agg_num / agg_den

pw = pw.rename({"pm25": "pm25_popweighted"})

agg_df = pw.to_dataframe().reset_index().drop(columns=['band', 'spatial_ref'])
agg_df
Loading...

The resulting dataframe shows population-weighted PM2.5 values for each district and time period. These values represent the average PM2.5 exposure experienced by people in each district, accounting for where people actually live.

Processing multiple years

If you have data spanning multiple years, you’ll want to use the appropriate population dataset for each year, since population distributions change over time. Here’s an example that loops through multiple years:

years = np.unique(pm_data['time'].dt.year).astype(str)

for year in years:
    print(f"Calculating {year}")
    pm_year = pm_data.sel(time=year)

    pop_file = f"../data/lka_pop_{year}_CN_1km_R2025A_UA_v1.tif"
    pop_ds = xr.open_dataset(pop_file)
    pop_data = pop_ds['band_data'].rename({"x": "lon", "y": "lat"})
    pop_aligned = pop_data.interp(lon=pm_year.lon, lat=pm_year.lat, method="nearest")

    pm_weighted = pm_year * pop_aligned

    agg_num = spatial.reduce(pm_weighted, districts, how="sum", mask_dim="code")
    agg_den = spatial.reduce(pop_aligned, districts, how="sum", mask_dim="code")

    pw = agg_num / agg_den
    pw = pw.rename({"pm25": "pm25_popweighted"})

    agg_df = pw.to_dataframe().reset_index().drop(columns=['band', 'spatial_ref'])
    print(agg_df)
Calculating 2022
         time  code  pm25_popweighted
0  2022-12-01  LK11         32.523187
1  2022-12-01  LK13         31.651156
2  2022-12-02  LK11         36.380375
3  2022-12-02  LK13         32.709560
4  2022-12-03  LK11         26.958593
..        ...   ...               ...
57 2022-12-29  LK13         36.403322
58 2022-12-30  LK11         33.888856
59 2022-12-30  LK13         37.663193
60 2022-12-31  LK11         29.629065
61 2022-12-31  LK13         28.865093

[62 rows x 3 columns]
Calculating 2023
         time  code  pm25_popweighted
0  2023-01-01  LK11         43.924891
1  2023-01-01  LK13         38.645365
2  2023-01-02  LK11         45.822828
3  2023-01-02  LK13         35.279168
4  2023-01-03  LK11         33.905775
..        ...   ...               ...
57 2023-01-29  LK13         57.848309
58 2023-01-30  LK11         47.405954
59 2023-01-30  LK13         39.707348
60 2023-01-31  LK11         41.998945
61 2023-01-31  LK13         36.978532

[62 rows x 3 columns]