23  Raster manipulation II

In this lesson we will learn how to select pixels in a raster by creating a mask from a polygon. The goal will be to calculate total number of meters of vessel tracks within the fishing districts of the Kodiak fishing registration area in Alaska.

23.1 Data

We will use three datsets about Alaska in this lesson.

Dataset 1

This is a raster dataset at 1km/pixel resolution showing shipping intensity in the Pacific Arctic region during August 2017. The value at each raster cell represents the total length, in meters, of all vessel tracks within each cell. The dataset is part of time series of shipping intensity monthly and the complete datset can be accessed at: https://arcticdata.io/catalog/view/doi:10.18739/A2SQ8QJ9S.

Dataset 2

Vector data showing statistical areas dividing waters of the State of Alaska and the adjacent Exclusive Economic Zone (EEZ) into small units for the purpose of reporting and analyzing fishery harvest. The dataset is archived at KNB and can be accessed at the following link: https://knb.ecoinformatics.org/view/doi:10.5063/F1QR4VJK.

Dataset 3

A polygon showing Alaska’s boundary extracted from the 2022 US Census TIGER shapefiles and with updated CRS.

23.2 Import data

Let’s start by importing the necessary libraries and functions:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches # for creating legends

import xarray as xr
import rioxarray as rioxr
import geopandas as gpd

from rasterio.features import rasterize # for rasterizing polygons

# -----------
# update pandas column display
pd.set_option('display.max_columns', None)
# open raster
raw_dist = rioxr.open_rasterio('https://arcticdata.io/metacat/d1/mn/v2/object/urn%3Auuid%3A0d223f34-77fc-4ebe-8a58-459b7e575668')
raw_dist
<xarray.DataArray (band: 1, y: 2308, x: 3087)>
[7124796 values with dtype=float32]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -2.55e+06 -2.549e+06 ... 5.347e+05 5.357e+05
  * y            (y) float64 2.711e+06 2.71e+06 ... 4.053e+05 4.043e+05
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    STATISTICS_MAXIMUM:  2564975.75
    STATISTICS_MEAN:     nan
    STATISTICS_MINIMUM:  0
    STATISTICS_STDDEV:   nan
    _FillValue:          -3.4e+38
    scale_factor:        1.0
    add_offset:          0.0
# open fishing areas polygons
fishing_areas = gpd.read_file(os.path.join(os.getcwd(), 'data', 'Alaska_Commercial_Salmon_Boundaries.gpkg'))
fishing_areas.head()
OBJECTID GEOMETRY_START_DATE GEOMETRY_END_DATE STAT_AREA STAT_AREA_NAME FISHERY_GROUP_CODE GIS_SERIES_NAME GIS_SERIES_CODE REGION_CODE REGISTRATION_AREA_NAME REGISTRATION_AREA_CODE REGISTRATION_AREA_ID REGISTRATION_LOCATION_ABBR MANAGEMENT_AREA_NAME MANAGEMENT_AREA_CODE DISTRICT_NAME DISTRICT_CODE DISTRICT_ID SUBDISTRICT_NAME SUBDISTRICT_CODE SUBDISTRICT_ID SECTION_NAME SECTION_CODE SECTION_ID SUBSECTION_NAME SUBSECTION_CODE SUBSECTION_ID COAR_AREA_CODE CREATOR CREATE_DATE EDITOR EDIT_DATE COMMENTS STAT_AREA_VERSION_ID Shape_Length Shape_Area geometry
0 12 1975-01-01 00:00:00+00:00 NaT 33461 Tanana River mouth to Kantishna River B Salmon B 3 Yukon Area Y None None Upper Yukon YU 6 District Y-6 None 6-A Subdistrict 6-A None None NaN None None None YU Evelyn Russel 2006-03-26 00:00:00+00:00 Sabrina Larsen 2017-02-02 00:00:00+00:00 Yukon District, 6 Subdistrict and 6-A Section ... None 4.610183 0.381977 MULTIPOLYGON (((-151.32805 64.96913, -151.3150...
1 13 1975-01-01 00:00:00+00:00 NaT 33462 Kantishna River to Wood River B Salmon B 3 Yukon Area Y None None Upper Yukon YU 6 District Y-6 None 6-B Subdistrict 6-B None None NaN None None None YU Evelyn Russel 2006-03-26 00:00:00+00:00 Sabrina Larsen 2017-02-02 00:00:00+00:00 Yukon District, 6 Subdistrict and 6-B Section ... None 3.682421 0.321943 MULTIPOLYGON (((-149.96255 64.70518, -149.9666...
2 18 1978-01-01 00:00:00+00:00 NaT 33431 Toklik to Cottonwood Point B Salmon B 3 Yukon Area Y None None Lower Yukon YL 3 District Y-3 None None None None None NaN None None None YL Evelyn Russel 2006-03-26 00:00:00+00:00 Sabrina Larsen 2017-02-02 00:00:00+00:00 Yukon District and 3 Subdistrict until 1/1/1980 None 2.215641 0.198740 MULTIPOLYGON (((-161.39853 61.55463, -161.4171...
3 19 1980-01-01 00:00:00+00:00 NaT 33442 Right Bank, Bishop Rock to Illinois Creek B Salmon B 3 Yukon Area Y None None Upper Yukon YU 4 District Y-4 None 4-B Subdistrict 4-B None None None NaN None None None YU Evelyn Russel 2006-03-26 00:00:00+00:00 Sabrina Larsen 2017-02-02 00:00:00+00:00 None None 9.179852 0.382788 MULTIPOLYGON (((-153.15234 65.24944, -153.0761...
4 20 1980-01-01 00:00:00+00:00 NaT 33443 Left Bank, Cone Point to Illinois Creek B Salmon B 3 Yukon Area Y None None Upper Yukon YU 4 District Y-4 None 4-B Subdistrict 4-B None None None NaN None None None YU Evelyn Russel 2006-03-26 00:00:00+00:00 Sabrina Larsen 2017-02-02 00:00:00+00:00 None None 9.500826 0.378262 MULTIPOLYGON (((-152.99905 65.17027, -152.9897...
# open Alaska boundary
ak = gpd.read_file(os.path.join(os.getcwd(), 'data', 'alaska_perimeter','alaska_perimeter.shp'))
ak
REGION DIVISION STATEFP STATENS GEOID STUSPS NAME LSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry
0 4 9 02 01785533 02 AK Alaska 00 G4000 A 1478943541175 245377731557 +63.3473560 -152.8397334 MULTIPOLYGON (((-1728945.561 474182.534, -1728...

23.3 Fishing areas preparation

We need to do some processing of each dataset before combining them for analysis. Let’s start with the fishing areas.

23.3.1 Update column names and CRS

# make column names small caps
fishing_areas.columns = fishing_areas.columns.str.lower()
print(fishing_areas.columns, "\n")

# -----------------------
# check CRS
print(f"raw_dist: {raw_dist.rio.crs} \nak: {ak.crs} \nfishing_areas: {fishing_areas.crs}\n")
#print(raw_dist.rio.crs == ak.crs)

# transform fishing_areas CRS to epsg:3338 (AK CRS)
fishing_areas = fishing_areas.to_crs(ak.crs)
print('CRS match:', raw_dist.rio.crs == fishing_areas.crs)
Index(['objectid', 'geometry_start_date', 'geometry_end_date', 'stat_area',
       'stat_area_name', 'fishery_group_code', 'gis_series_name',
       'gis_series_code', 'region_code', 'registration_area_name',
       'registration_area_code', 'registration_area_id',
       'registration_location_abbr', 'management_area_name',
       'management_area_code', 'district_name', 'district_code', 'district_id',
       'subdistrict_name', 'subdistrict_code', 'subdistrict_id',
       'section_name', 'section_code', 'section_id', 'subsection_name',
       'subsection_code', 'subsection_id', 'coar_area_code', 'creator',
       'create_date', 'editor', 'edit_date', 'comments',
       'stat_area_version_id', 'shape_length', 'shape_area', 'geometry'],
      dtype='object') 

raw_dist: EPSG:3338 
ak: EPSG:3338 
fishing_areas: EPSG:4326
CRS match: True

23.3.2 Data selection

This data includes fine scale subdivisions of the fishing areas. We’ll be working with data from the Kodiak registration area only, so let’s select that.

# Registration areas:
print(fishing_areas.registration_area_name.unique())

# select Kodiak area
kodiak = fishing_areas[fishing_areas.registration_area_name == 'Kodiak Area' ]
print('# geometries in Kodiak area:', len(kodiak) )
['Yukon Area' 'Kuskokwim Area' 'Prince William Sound Area'
 'Bristol Bay Area' 'Chignik Area' 'Kodiak Area' 'Alaska Peninsula Area'
 'Cook Inlet Area' 'Norton Sound-Port Clarence Area'
 'Arctic-Kotzebue Area' 'Aleutian Islands Area' 'Atka-Amlia Islands Area'
 'Southeastern Alaska Area']
# geometries in Kodiak area: 118

23.3.3 Plot of Kodiak fisihing area

One way to add a custom legend is to create an Patch object for each geometry in our plot.

We can create a Patch with a custom color and legend via the mpatches.Patch() function:

great_patch =  mpatches.Patch( color = 'great_color',
                               label = 'great_label')

To add this patch to the legend we should first create an axis, for example by calling fig, ax = plt.subplots() at the beginning of our plot. Then we add great_patch to the legend:

ax.legend(handles = [great_patch])

Example:

# create AK plot with color patches in legend

fig, ax = plt.subplots()
# --------------------------------------------
ak.plot(ax=ax, color = 'pink')
ak_patch = mpatches.Patch(color='pink',
                          label='Alaska, US')

# --------------------------------------------
#kodiak.dissolve().plot(ax=ax, color = 'blue')
kodiak.plot(ax=ax, color = 'blue')
kodiak_patch = mpatches.Patch(color='blue',
                              label='Kodiak fishing area')

# ------------------------------------------
# create elegend
ax.legend(handles = [ak_patch, kodiak_patch], frameon=False, loc='upper right')

plt.show()

23.3.4 Dissolve & explode polygons

The Kodiak data is too granular for our purposes: we want to get statistics on distance travelled on each fishing district.

# check number of districts and rows
print(f"number of geometries: {len(kodiak)}")
print(f"number of districts: {len(kodiak.district_name.unique())}")
number of geometries: 118
number of districts: 7

We can aggregate the geometries based on the values of a column using the dissolve() method. In our case we will dissolve by district name:

# dissolve by district
# as_index=False indicates we want to keep the "groupby" column as a column (not index)
districts = kodiak[['geometry','district_name']].dissolve(by='district_name', as_index=False)
districts 
district_name geometry
0 Afognak District POLYGON ((83189.145 888000.815, 75313.997 8878...
1 Alitak Bay District POLYGON ((-20595.210 769560.395, -20547.170 76...
2 Eastside Kodiak District MULTIPOLYGON (((-100710.109 633391.483, -10080...
3 Mainland District MULTIPOLYGON (((-139689.424 797362.176, -14034...
4 Northeast Kodiak District POLYGON ((100175.379 849237.562, 100103.728 84...
5 Northwest Kodiak District POLYGON ((9534.303 838682.678, 9542.565 838668...
6 Southwest Kodiak District POLYGON ((-31133.469 786483.521, -31127.360 78...

Let’s take a look at the districts:

Legend location

We can control the legend location adding loc and bbox_to_anchor to the legend_kwds.

It can be useful to combine both of these to place the legend outside the graph:

  • loc indicates the corner of the legend box we want to locate, and
  • bbox_to_anchor is a tuple with coordinates indicating where to place the corner specified in loc. Values between 0 and 1 are within the axes (the plot).
districts.plot(column='district_name', 
               legend=True, 
               legend_kwds={'loc': "upper left", 'bbox_to_anchor': (1, 1)})
/Users/galaz-garcia/anaconda3/envs/mpc-env/lib/python3.11/site-packages/geopandas/plotting.py:732: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(values.dtype):
<Axes: >

Notice the geometries of some our districts are multipolygons. To simplify our analysis for the purpose of this lesson we will separate each multipolygon into individual polygons. We can do this using the explode() method for gpd.GeoDataFrames:

# explode polygons
# ignore_index=True resests the index so we don't get a multi-index
districts = districts.explode(ignore_index=True)
districts
district_name geometry
0 Afognak District POLYGON ((83189.145 888000.815, 75313.997 8878...
1 Alitak Bay District POLYGON ((-20595.210 769560.395, -20547.170 76...
2 Eastside Kodiak District POLYGON ((-100710.109 633391.483, -100806.993 ...
3 Eastside Kodiak District POLYGON ((-14519.794 714624.553, -14616.441 71...
4 Mainland District POLYGON ((-139689.424 797362.176, -140346.100 ...
5 Mainland District POLYGON ((-13390.443 922237.350, -13416.238 92...
6 Northeast Kodiak District POLYGON ((100175.379 849237.562, 100103.728 84...
7 Northwest Kodiak District POLYGON ((9534.303 838682.678, 9542.565 838668...
8 Southwest Kodiak District POLYGON ((-31133.469 786483.521, -31127.360 78...

23.4 Distance raster preparation

Now let’s move on to our raster.

23.4.1 Squeeze

First, we have an extra unnecessary dimension. Let’s get rid of it:

# get rid of band dimension
print(f"Before squeeze:\ndimensions {raw_dist.dims} \ncoords: {raw_dist.coords} \n")

raw_dist = raw_dist.squeeze().drop('band')
print(f"After squeeze:\ndimensions {raw_dist.dims}\ncoords: {raw_dist.coords}")
Before squeeze:
dimensions ('band', 'y', 'x') 
coords: Coordinates:
  * band         (band) int64 1
  * x            (x) float64 -2.55e+06 -2.549e+06 ... 5.347e+05 5.357e+05
  * y            (y) float64 2.711e+06 2.71e+06 ... 4.053e+05 4.043e+05
    spatial_ref  int64 0 

After squeeze:
dimensions ('y', 'x')
coords: Coordinates:
  * x            (x) float64 -2.55e+06 -2.549e+06 ... 5.347e+05 5.357e+05
  * y            (y) float64 2.711e+06 2.71e+06 ... 4.053e+05 4.043e+05
    spatial_ref  int64 0

23.4.2 Updating no-data value

Take a look at the data:

raw_dist.plot()
<matplotlib.collections.QuadMesh at 0x148d15350>

The previous plot should make us think there are outliers or no-data values affecting the plot range. Let’s investigate this:

# check min and max
print(f"min: {raw_dist.min().item():,}, max: {raw_dist.max().item():,}")

# get no-data value
nodata = raw_dist.rio.nodata
print('no-data value', nodata)
min: -3.3999999521443642e+38, max: 2,564,975.75
no-data value -3.4e+38

We can select all pixels where there is raster data using the where() method. By default, where() will place np.nan (NAs) at every cell that does not satisty the condition.

# select pixels with data using where()
raw_dist.where(raw_dist != nodata)

# transform no data values into nan (float)
dist = raw_dist.where(raw_dist != nodata)

# check updates
print(f" min: {dist.min().item()}, max: {dist.max().item():,}")
dist.plot()
 min: 0.0, max: 2,564,975.75
<matplotlib.collections.QuadMesh at 0x148ee3050>

23.4.3 Clipping

Since we are only focusing on the Kodiak district, let’s clip the raster to this region:

# clip raster to Kodiak bounding box
kodiak_dist = dist.rio.clip_box(*districts.total_bounds)

# robust = True plots data within the 2th and 98th percentiles.
kodiak_dist.plot(robust=True)
<matplotlib.collections.QuadMesh at 0x14935b610>

Remember our goal is to calculate the total distance travelled in each district. We can take a look at the districts over the distance raster:

# raster + districts plot
fig, ax = plt.subplots()
kodiak_dist.plot(ax=ax, robust=True)
districts.plot(ax=ax, edgecolor='red', color='none')
plt.show()

23.5 Distance in a single district

A mask is a dataset that indicates which locations in the raster we will keep and which will be converted to NAs or, in our case, zeros. It is common to transform a polygon into a raster mask that selects all pixels that touch the polygon.

To transform a geometry into a mask we can use the rasterize() function from rasterio.features:

rasterio is another popular Python library to work with rasters. Although many of its functions have been streamlined in rioxarray, there are still some that have not been replaced.

In the next cell we use the polygon for the Alitak district districts.loc[1,'geometry'] to create a mask for the raster.

# create raster mask based on Alitak polygon
alitak_mask = rasterize(shapes = [districts.loc[1,'geometry']],  # a list of geometries
                        out_shape = kodiak_dist.shape,           # shape of outout np array (shape of raster)
                        transform = kodiak_dist.rio.transform(), # transformation of raster
                        all_touched = True)  # all pixels touched by geometries will be burned in 
print(type(alitak_mask))

# print some info about mask
print('mask shape: ', alitak_mask.shape)
print('same shape as raster? ', alitak_mask.shape == kodiak_dist.shape)
print('unique values in mask', np.unique(alitak_mask))
plt.imshow(alitak_mask)
<class 'numpy.ndarray'>
mask shape:  (352, 289)
same shape as raster?  True
unique values in mask [0 1]
<matplotlib.image.AxesImage at 0x148f7aed0>

To select pixels from the raster using the mask we can simply multiply both rasters together.

# apply mask to kodiak distances raster
dist_alitak = kodiak_dist * alitak_mask
dist_alitak.plot(robust=True)
<matplotlib.collections.QuadMesh at 0x149244690>

Finally, we can add up all the pixel values to obtain the total distance travelled by shipping vessels in Alitak:

print(f"total distance in Alitak: {dist_alitak.sum().item():,}")
total distance in Alitak: 5,606,546.5

Doing the previous process in a single cell:

# mask raster and calculate sum of pixels
alitak_mask = rasterize([districts.loc[1,'geometry']],
                                    out_shape = kodiak_dist.shape,
                                    transform = kodiak_dist.rio.transform(),
                                    all_touched = True)
dist_in_alitak = (kodiak_dist * alitak_mask).sum().item()
dist_in_alitak
5606546.5

23.6 Distance in all districts

Once we have achieved a good workflow for a single district, it is easy to repeat it using a for loop:

# calculate distance per district polygon
distances = []
for i in districts.index:
    mask = rasterize([districts.loc[i,'geometry']],
                              out_shape=kodiak_dist.shape,
                              transform=kodiak_dist.rio.transform(),
                              all_touched=True)
    dist_in_district = (kodiak_dist * mask).sum().item()
    distances.append(dist_in_district)
distances
[9804045.0,
 5606546.5,
 102465.03125,
 22725528.0,
 4788347.0,
 673996.375,
 20008534.0,
 86952560.0,
 6666639.0]

Then we can add the distances as a column in our GeoDataFrame:

# add column with distance per district polygon
districts['total_distance'] = distances
districts
district_name geometry total_distance
0 Afognak District POLYGON ((83189.145 888000.815, 75313.997 8878... 9.804045e+06
1 Alitak Bay District POLYGON ((-20595.210 769560.395, -20547.170 76... 5.606546e+06
2 Eastside Kodiak District POLYGON ((-100710.109 633391.483, -100806.993 ... 1.024650e+05
3 Eastside Kodiak District POLYGON ((-14519.794 714624.553, -14616.441 71... 2.272553e+07
4 Mainland District POLYGON ((-139689.424 797362.176, -140346.100 ... 4.788347e+06
5 Mainland District POLYGON ((-13390.443 922237.350, -13416.238 92... 6.739964e+05
6 Northeast Kodiak District POLYGON ((100175.379 849237.562, 100103.728 84... 2.000853e+07
7 Northwest Kodiak District POLYGON ((9534.303 838682.678, 9542.565 838668... 8.695256e+07
8 Southwest Kodiak District POLYGON ((-31133.469 786483.521, -31127.360 78... 6.666639e+06

23.7 Acknowledgments

This lesson was adapted from the Spatial and Image Data Using GeoPandas lesson from the NCEAS Scalable and Computationally Reproducible Approaches to Arctic Research workshop.