import os
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
# this is our first time using this function
from shapely.geometry import box
16 Reprojecting & Clipping
In this lesson we will learn how to change the CRS of a geopandas.GeoDataFrame
and how to clip different geometries using a polygon in a geopandas.GeoDataFrame
. Through doing these operations we will create the following map of roads and populated places in Alaska:
16.1 Data
We will use three datasets in this lesson.
First dataset The first dataset is a TIGER shapefile from the United States Census Bureau.
We will use the shapefiles for the US states. Follow these steps to download shapefile with the United States’ states:
- At the bottom of the 2022 page, under Download, click on “Web Interface”
- For year, select 2022, and for layer type select “States (and equivalent)”. Click submit.
- Click on “Download national file”.
You can check the metadata for all the TIGER shapefiles here. The columns for this shapefile are:
Second dataset The second dataset we’ll use is Natural Earth’s simple medium scale populated places dataset. We can obtain this dataset by downloading the shapefile (choose the one that says “simple (less columns)”).
Third dataset The third dataset we’ll use is Natural Earth’s road dataset. We can obtain this dataset by downloading the shapefile
Move all datasets to a directory named “data” inside your working directory.
16.2 Import data
Let’s start by loading our libraries and then importing the datasets we’ll use.
# display all columns when looking at dataframes
"display.max.columns", None)
pd.set_option(
# ----- IMPORT DATA -----
# states from US Census TIGER files
= gpd.read_file(os.path.join('data','tl_2022_us_state','tl_2022_us_state.shp'))
states # make column names small caps
= states.columns.str.lower()
states.columns
# populated places from Natural Earth
= gpd.read_file(os.path.join('data','ne_50m_populated_places_simple','ne_50m_populated_places_simple.shp'))
places
# roads from Natural Earth
= gpd.read_file(os.path.join('data','ne_10m_roads','ne_10m_roads.shp')) roads
16.3 Prepare Alaska polygon
16.3.1 Exploration
Let’s start by taking taking a look at our stats geo-dataframe. Since this is a geospatial dataset, exploration should include at least checking the head of the dataset, plotting the data, and looking at its CRS.
# print the CRS
print(states.crs)
# look at first five columns
3) states.head(
EPSG:4269
region | division | statefp | statens | geoid | stusps | name | lsad | mtfcc | funcstat | aland | awater | intptlat | intptlon | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 5 | 54 | 01779805 | 54 | WV | West Virginia | 00 | G4000 | A | 62266456923 | 489045863 | +38.6472854 | -080.6183274 | POLYGON ((-77.75438 39.33346, -77.75422 39.333... |
1 | 3 | 5 | 12 | 00294478 | 12 | FL | Florida | 00 | G4000 | A | 138962819934 | 45971472526 | +28.3989775 | -082.5143005 | MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ... |
2 | 2 | 3 | 17 | 01779784 | 17 | IL | Illinois | 00 | G4000 | A | 143778515726 | 6216539665 | +40.1028754 | -089.1526108 | POLYGON ((-87.89243 38.28285, -87.89334 38.282... |
states.plot()
<Axes: >
16.3.2 Selection
For this lesson, we are intersted in plotting data in Alaska. Let’s start by selecting this data:
# select Alaska from states
= states[states.name =='Alaska']
alaska alaska.plot()
<Axes: >
16.3.3 Reprojecting
As in our previous lesson, we bump into the issue of Alaska’s islands elongating the map. To fix this, we will reproject the Alaska geo-dataframe. Reprojecting means precisely this, changing the coordinate reference system of your geospatial data. In our case we will reproject the Alaska geo-dataframe to the CRS EPSG:3338. This CRS is a projected CRS, better suited for working with data from Alaska:
Changing CRSs in GeoPandas is very simple using the to_crs()
method for gpd.GeoDataFrame
s. The general syntax is:
= geodf.to_crs(new_crs) updated_geodf
where:
updated_geodf
is the output of the method, a new geodataframe (to_crs()
does not work in place),geodf
is thegpd.GeoDataFrame
we want to transform,new_crs
an object of type CRS or string representing the CRS (ex:'epsg:3338'
), the CRS we want to convert to.
In our case:
# change to projected CRS optimized for Alaska
= alaska.to_crs('epsg:3338')
alaska alaska.plot()
<Axes: >
# check new CRS
print('is this CRS projected? ', alaska.crs.is_projected)
alaska.crs
is this CRS projected? True
<Projected CRS: EPSG:3338>
Name: NAD83 / Alaska Albers
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: United States (USA) - Alaska.
- bounds: (172.42, 51.3, -129.99, 71.4)
Coordinate Operation:
- name: Alaska Albers (meters)
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
16.4 Prepare populated places
16.4.1 Exploration
Let’s now explore the populated places data.
# print the CRS
print(places.crs)
# look at first five columns
3) places.head(
EPSG:4326
scalerank | natscale | labelrank | featurecla | name | namepar | namealt | nameascii | adm0cap | capalt | capin | worldcity | megacity | sov0name | sov_a3 | adm0name | adm0_a3 | adm1name | iso_a2 | note | latitude | longitude | pop_max | pop_min | pop_other | rank_max | rank_min | meganame | ls_name | min_zoom | ne_id | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10 | 1 | 5 | Admin-1 region capital | Bombo | None | None | Bombo | 0 | 0 | None | 0 | 0 | Uganda | UGA | Uganda | UGA | Bamunanika | UG | None | 0.583299 | 32.533300 | 75000 | 21000 | 0.0 | 8 | 7 | None | None | 7.0 | 1159113923 | POINT (32.53330 0.58330) |
1 | 10 | 1 | 5 | Admin-1 region capital | Fort Portal | None | None | Fort Portal | 0 | 0 | None | 0 | 0 | Uganda | UGA | Uganda | UGA | Kabarole | UG | None | 0.671004 | 30.275002 | 42670 | 42670 | 0.0 | 7 | 7 | None | None | 7.0 | 1159113959 | POINT (30.27500 0.67100) |
2 | 10 | 1 | 3 | Admin-1 region capital | Potenza | None | None | Potenza | 0 | 0 | None | 0 | 0 | Italy | ITA | Italy | ITA | Basilicata | IT | None | 40.642002 | 15.798997 | 69060 | 69060 | 0.0 | 8 | 8 | None | None | 7.0 | 1159117259 | POINT (15.79900 40.64200) |
places.plot()
<Axes: >
This dataset has CRS EPSG:4326, this is the EPSG code for the WGS84 CRS. This is not surprise since this is a global dataset, and EPSG:4326/WGS84 is the most used CRS for such data.
Let’s see what happens when we try to plot this data on top of Alaska:
# Trouble
= plt.subplots()
fig, ax
=ax)
alaska.plot(ax=ax, color='red')
places.plot(ax
plt.show()
This is a classic mistake in analysis. To plot, analyze, or integrate different geospatial datasets they must have the same CRS.
Here, alaska
and places
have different CRSs, leading to unexpected results when plotting them together:
print(alaska.crs)
print(places.crs)
epsg:3338
EPSG:4326
16.4.2 Reprojecting
Reprojecting the places
geo-datafarme into alaska
’s CRS is simple using to_crs()
:
# update crs
= places.to_crs(alaska.crs)
places print(alaska.crs == places.crs)
True
Let’s check that map again:
= plt.subplots()
fig, ax
=ax)
alaska.plot(ax=ax, color='red', markersize=2 )
places.plot(ax
plt.show()
This is better: we can see there is the Alaska poygons and some points on top of it. Our next step is to select these points.
16.4.3 Clipping
Clipping means using a polygon (or polygons) to only select geospatial data within them. Clipping a gpd.GeoDataFrame
is easy using the geopandas clip()
function. The general syntax is:
= gpd.clip(geodf, mask) updated_geodf
where:
updated_geodf
is the output of the method: the intersection of the geometries ingeodf
withmask
,geodf
is thegpd.GeoDataFrame
we want to clip,mask
is agpd.GeoDataFrame
with the polygon(s) we want to use for clipping. Thismask
must be in the same CRS asgeodf
!
In our case:
# clip places to Alaska
= gpd.clip(places, alaska)
ak_places print('Number of places in AK:', len(ak_places))
Number of places in AK: 12
# plot populated places in Alaska
= plt.subplots()
fig, ax
=ax)
alaska.plot(ax=ax, color='red')
ak_places.plot(ax
plt.show()
16.5 Prepare roads
16.5.1 Exploration
Now we move on to our roads dataset.
# print the CRS
print(roads.crs)
# look at first five columns
3) roads.head(
EPSG:4326
scalerank | featurecla | type | sov_a3 | note | edited | name | namealt | namealtt | routeraw | question | length_km | toll | ne_part | label | label2 | local | localtype | localalt | labelrank | ignore | add | rwdb_rd_id | orig_fid | prefix | uident | continent | expressway | level | min_zoom | min_label | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | Road | Secondary Highway | CAN | None | Version 1.5: Changed alignment, a few adds in ... | None | None | None | None | 0 | 3 | 0 | ne_1d4_original | None | None | None | None | None | 0 | 0 | 0 | 0 | 0 | None | 314705 | North America | 0 | None | 7.1 | 9.6 | LINESTRING (-133.32533 62.21571, -133.31664 62... |
1 | 7 | Road | Secondary Highway | USA | None | Version 1.5: Changed alignment, a few adds in ... | 83 | None | None | None | 0 | 164 | 0 | ne_1d4_original | None | None | None | None | None | 0 | 0 | 0 | 0 | 0 | None | 108105 | North America | 0 | Federal | 7.0 | 8.6 | LINESTRING (-100.50543 42.80753, -100.53495 42... |
2 | 7 | Road | Secondary Highway | USA | None | Version 1.5: Changed alignment, a few adds in ... | 840 | None | None | None | 0 | 98 | 0 | ne_1d4_original | None | None | None | None | None | 0 | 0 | 0 | 0 | 0 | None | 0 | North America | 0 | U/C | 7.0 | 9.5 | LINESTRING (-87.27432 36.02439, -87.22916 35.9... |
roads.plot()
<Axes: >
16.5.2 One-liner clipping
You may have already noticed that roads
is not in the same CRS as alaska
, so these geo-datasets shound’t interact until they’re in the same CRS. Before jumping right into reprojecting and clipping, we will subset the data to select only US roads:
# select US roads only
= roads[roads.sov_a3 == 'USA']
usa_roads usa_roads.plot()
<Axes: >
Geospatial operations are usually costly. The more detailed our geometries the longer in takes to do geospatial computations. It’s a good practice to try to reduce your data as much as possible before applying any geospatial transformation.
We will now do a “one-liner” to clip usa_roads
using the alaska
polygon. Notice we are using the ouput of usa_roads.to_crs(alaska.crs)
and thus not changing the usa_roads
geo-dataframe or creating new variables:
# clip usa_roads to alaska geometry
= gpd.clip(usa_roads.to_crs(alaska.crs),alaska)
ak_roads
ak_roads.plot()
<Axes: >
Notice how the lines break on the small islands? However in the usa_roads
there are no broken lines. This should make us suspect we are leaving data out and clipping exactly to the polygons in alaska
is not quite what we want.
16.5.3 Clipping with bounding box
We will clip the usa_roads
geo-dataframe with the bounding box of alaska
instead of its polygons. To create a bounding box, we first use the box()
function we imported from shapely.geometry
. The syntax for box()
is:
box(minx, miny, maxx, maxy)
the output is a X representing a box constructed like this:
If we want to create a shapely polygon from the bounds of a geo-dataframe gdf
, a more straightforward syntax is:
*gdf.total_bounds) box(
In our case:
= box(*alaska.total_bounds)
bbox print(type(bbox))
bbox
<class 'shapely.geometry.polygon.Polygon'>
*
as unpacking operator
In the last syntax we are using the asterisk *
as an unpacking operator on the array gdf.total_bounds
. Think about it as unpacking the elements of gdf.total_bounds
and assigning them one-by-one to the paremeters minx, miny, maxx, maxy
of box()
.
This is a good article explaining more about unpacking with *
in Python: https://geekflare.com/python-unpacking-operators/
# create geo-dataframe from bounding box
= gpd.GeoDataFrame(geometry = [bbox], # assign geometry column
ak_bbox = alaska.crs) # assign CRS
crs print(type(ak_bbox))
ak_bbox
<class 'geopandas.geodataframe.GeoDataFrame'>
geometry | |
---|---|
0 | POLYGON ((1493082.309 404545.108, 1493082.309 ... |
We can now clip the roads using Alaska’s bounding box:
= gpd.clip(usa_roads.to_crs(ak_bbox.crs), ak_bbox)
ak_complete_roads ak_complete_roads.plot()
<Axes: >
Notice the difference between the two clipping methods:
# two rows, one column
= plt.subplots(2, 1, figsize=(10,10))
fig, (ax1, ax2)
=ax1)
ak_roads.plot(ax'Roads clipped with AK multipolygon')
ax1.set_title(
=ax2)
ak_complete_roads.plot(ax'Roads clipped with AK bounding box')
ax2.set_title(
#plt.axis('equal')
plt.show()
16.6 Plot
Finally, we can put all our data together in the same map:
# https://matplotlib.org/stable/api/markers_api.html
# Trouble: not in the same CRS
# this is cool! but now we are seeing all Arctic comms
= plt.subplots(figsize=(12,8))
fig, ax # --------------------------
'off')
ax.axis(
=ax, color='none', edgecolor='0.7')
alaska.plot(ax=ax, column='type', legend=True)
ak_complete_roads.plot(ax#ak_comms.plot(ax=ax, color='red')
=ax, color='red', marker='s')
ak_places.plot(ax
# --------------------------
plt.show()
/Users/galaz-garcia/anaconda3/envs/mpc-env/lib/python3.11/site-packages/geopandas/plotting.py:732: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(values.dtype):