Part 1: GeoFLOW - Geospatial workflow using Python & Jupyter Notebook
Geospatial Data Analysis is more visual than just numbers. As a Data Scientist, you would prefer to look at the intermediary steps in the processing pipeline. This has a two-fold impact, first, it provides a better intuition about your data & second, makes bug tracking easier. Let us try to understand this better, by solving a mystery, Where do people tend to live?
Where do people live?
Population data is available at a low spatial & temporal resolution. For most of the countries around the world, it is done once every 10 years & available at a ward level at best.
Government, NGOs & other public bodies design a lot of policies & measures based on this outdated dataset. In the event of an endemic or natural disaster, it would help a lot if you know exactly where people live & track their movements.
To address this issue, we will try to estimate the human population at 100m x 100m GRID level. Using Building footprints as a proxy for human settlement, we will see how to distribute a global data source like Gridded Population of the World (GPW) can be used to estimate the human population at such granular resolution.
Details of the city under consideration
We are looking at the city of Fortportal, which is regarded as the tourism city in Uganda
REGION = 'fortportal'
UTM = 32636
PIPELINE = 'output'
region = gpd.read_file(INPUT/f'{REGION}.geojson').to_crs(epsg=WGS84)
box = region.total_bounds
show(region)
Load Population data
- Read the Gridded Population of the World (GPW) dataset, which is available as a Raster TIFF file
- Clip it to the city boundary using the GeoJSON file
- Save the clipped raster
gpw = rio.open(INPUT/'gpw_world.tif')
gpw_region = gpw.read(1, window=gpw.window(*box))
# Clip to the region boundary
region_mask, region_mask_tfm = mask(dataset=gpw, shapes=region.geometry, all_touched=True, crop=True, filled=True)
region_mask = np.where(region_mask < 0, 0, region_mask).squeeze()
# Save the clipped raster
region_meta = gpw.meta
region_meta.update(dict(
driver='GTiff',
height=region_mask.shape[0],
width=region_mask.shape[1],
transform=region_mask_tfm
))
with rio.open(INTER/f'{REGION}_gpw_output.tif', 'w', **region_meta) as f: f.write(region_mask, indexes=1)
show(region, region_mask)
!rio shapes {INTER/REGION}_gpw_output.tif --bidx 1 --precision 6 > {INTER/REGION}_gpw_output.geojson
Gridify
The vectorized GPW GeoJSON file gives us the population for the city at 1km x 1km spatial resolution.
The intent is to estimate the population at 100m x 100m spatial resolution. To achieve this, we have to:
- Gridify: Split the 1km x 1km TILE into smaller grids of size 100m x 100m
- Each 1km x 1km TILE will give us 100 GRIDs of size 100m x 100m
def gridify(tile):
polygons = []
xmin,ymin,xmax,ymax = tile.geometry.bounds
width = tile.tile_width
height = tile.tile_height
stepx = +(xmax - xmin)/(10 * width )
stepy = -(ymax - ymin)/(10 * height)
for x in np.arange(xmin, xmax, stepx):
for y in np.arange(ymax, ymin, stepy):
poly = [
(x , y ),
(x + stepx, y ),
(x + stepx, y + stepy),
(x , y + stepy)
]
polygons.append(Polygon(poly))
d = {
'geometry': polygons,
'tile_idx': tile.tile_idx,
'tile_population': tile.tile_population,
'tile_width': tile.tile_width,
'tile_height': tile.tile_height
}
grids_gdf = gpd.GeoDataFrame(d, crs=f'EPSG:{MERCATOR}')
tile_gdf = gpd.GeoDataFrame(tile.to_frame().T, crs=f'EPSG:{MERCATOR}')
grids_gdf = gpd.clip(grids_gdf, tile_gdf)
return grids_gdf
# For each TILE create the GRIDs
grids_gdf = tiles_gdf.apply(gridify, axis=1)
grids_gdf = pd.concat(grids_gdf.to_list())
grids_gdf = grids_gdf.reset_index(drop=True).reset_index().rename(columns={'index': 'idx'}).to_crs(epsg=UTM)
# Change the CRS of TILEs & GRIDs back to their region respective UTM coordinates
tiles_gdf = tiles_gdf.to_crs(epsg=UTM)
grids_gdf = grids_gdf.to_crs(epsg=UTM)
region = region.to_crs(epsg=UTM)
grids_gdf.head()
Load building footprints dataset
We are using buildings as a proxy for estimating the human population.
Microsoft has been kind of to release building footprints for multiple regions around the globe, please check them out HERE. Fortunately, the city we are considering here (Fortportal in Uganda) is also made available by them.
If you are considering a region for which there are no readily available building footprints, here are a few alternatives:
- Check Open Street Map (OSM), they have good coverage of a lot of places around the globe. Also, consider contributing to that as well
- Put your deep learning skills in practice, try getting hands on some satellite imagery of the region & generate building footprints using Segmentation models. Dave luo has a nice blog post about the same LINK.
footprints_gdf = (gpd.read_file(f'{INPUT/REGION}_footprints.geojson')
.to_crs(epsg=UTM)
.assign(centroid=lambda x: x.centroid,
building_area=lambda x:x.geometry.area)
.rename(columns={'geometry': 'building_count'})
.set_geometry('centroid'))
footprints_gdf.head()
Compute building statistics (# of buildings, area occupied by buildings) for each GRID
Now, we have:
- GRIDs at 100m x 100m resolution
- Building footprints for the city
We would like to find the buildings falling under each GRID & then compute their count & area occupancy. We can do a spatial join between the two DataFrames, which is much faster thanks to geopandas > 0.8. We take the centroid of each building footprints & do a spatial join with the GRID boundaries.
def sjoin_polygon_footprints(poly_gdf, footprints_gdf, idx, name, agg):
poly_gdf = gpd.sjoin(poly_gdf, footprints_gdf, how='left', op='intersects')
poly_gdf = (poly_gdf.groupby(idx)
.agg(agg)
.reset_index())
poly_gdf = (gpd.GeoDataFrame(poly_gdf, crs=f'EPSG:{UTM}')
.rename(columns={'building_area' : f'{name}_building_area',
'building_count': f'{name}_building_count'}))
return poly_gdf
agg = {
'geometry' : 'first',
'building_area' : 'sum',
'building_count' : 'count',
'tile_idx' : 'first',
'tile_population' : 'first',
'tile_width' : 'first',
'tile_height' : 'first'
}
grids_gdf = sjoin_polygon_footprints(grids_gdf, footprints_gdf, 'idx', 'grid', agg=agg)
grids_gdf.head()
Remove extra GRIDs that fall beyond the region boundary & adjust the population accordingly
The GPW population data is available to us at 1km x 1km resolution, we are clipping that to the region boundary. When we split the TILE into GRIDs, along the edges there will be GRIDs that fall beyond the region boundary. However, we will not have building footprints available for areas outside our region boundary, so we remove the extra grids & adjust the population count by taking a ratio of the # of grids that fall within the region boundary.
grids_gdf = gpd.sjoin(grids_gdf, region[['geometry']], how='inner', op='intersects').drop(labels='index_right', axis=1)
# Fix the index of the grids
grids_gdf = (grids_gdf.drop(labels='idx', axis=1)
.reset_index(drop=True).reset_index()
.rename(columns={'index': 'idx'}))
# Adjust the population accordingly
def recompute_population_by_grids(gp):
if gp.grid_building_count.sum() <= 1: gp.tile_population = 0
return gp.tile_population * (gp.shape[0] / (gp.tile_width * gp.tile_height * 100))
grids_gdf['tile_population'] = (grids_gdf.groupby('tile_idx')
.apply(recompute_population_by_grids)
.values)
show(region, grids_gdf)
Compute the Population at 100m x 100m GRID level
For each grid, we have the TILE population, # of buildings inside its boundary & the area of building inside its boundary.
We can statistically distribute the TILE population to the GRIDs by considering the building properties inside them. Like we discussed, buildings are a good proxy for the human population.
# Tile population distributed by the ratio of # of buildings for each grid
grid_population_count = tile_population * (grid_building_count / tile_building_count)
# Tile population distributed by the ratio of area of buildings for each grid
grid_population_area = tile_population * (grid_building_area / tile_building_area )
grid_population = Weighted average of the above 2 metrics
grids_gdf[['tile_building_count', 'tile_building_area']] = grids_gdf.groupby('tile_idx')[['grid_building_count', 'grid_building_area']].transform('sum')
# Statistically distribute the TILE population to GRIDs
grids_gdf['grid_population'] = (0.5 * grids_gdf['tile_population'] * (grids_gdf['grid_building_count'] / grids_gdf['tile_building_count']) +
0.5 * grids_gdf['tile_population'] * (grids_gdf['grid_building_area' ] / grids_gdf['tile_building_area' ]))
grids_gdf.loc[:, 'grid_population'] = grids_gdf['grid_population'].fillna(0)
grids_gdf.head()
grids_gdf.to_crs(WGS84).to_file(f'{OUTPUT/REGION}_grids_output_{WGS84}.geojson', driver='GeoJSON')
Hopefully, this gives you a fair idea of why & how to use Notebooks to create GeoSpatial workflows.
In the next post, we will see how to automate notebook execution using papermill.