import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
15 csv to GeoDataFrame
In this lesson we will learn how to make a csv into a geopandas.GeoDataFrame
, setting its CRS manually and learn some more customizations for maps and matplotlib figures.
15.1 Data
We will use two datasets in this lesson. The first one is a reprojection of this dataset from the U.S. Energy Information Administration (EIA) with information about operable electric generating plants in the United States by energy source, as of May 2023.
Follow these steps to download the reprojected datset for this lesson:
- Go to https://github.com/carmengg/eds-220-book/blob/main/data/power_plants_epsg4269.csv
- Download the raw file and move it into your working directory
You can access the metadata for this dataset here.
The second dataset is a TIGER shapefile from the United States Census Bureau. TIGER stands for Topologically Integrated Geographic Encoding and Referencing. This used to be the data format the US Census distributed geospatial data, but since 2008 TIGER files are converted to shapefiles. We will use the shapefiles for the US states. Follow these steps to download shapefile with the United States’ states:
- At the bottom of the 2022 page, under Download, click on “Web Interface”
- For year, select 2022, and for layer type select “States (and equivalent)”. Click submit.
- Click on “Download national file”.
You can check the metadata for all the TIGER shapefiles here. The columns for this shapefile are:
File management: Both datasets must be in a data directory inside your working directory.
15.2 DataFrame
to GeoDataFrame
Let’s start by importing the necessary libraries:
And update pandas display options:
# display all column when looking at dataframes
"display.max.columns", None) pd.set_option(
Our first step is to import the power plants dataset. Notice this is a csv, geopandas doesn’t have a way to extract a geometry column from a csv, so we will need to create this geometry manually. We start by reading in the data using pandas
.
# import power plants data
= pd.read_csv('data/power_plants_epsg4269.csv')
power_plants 3) power_plants.head(
Unnamed: 0 | objectid | plant_code | plant_name | utility_id | utility_name | sector_name | street_address | city | county | state | zip | primsource | source_desc | tech_desc | install_mw | total_mw | bat_mw | bio_mw | coal_mw | geo_mw | hydro_mw | hydrops_mw | ng_mw | nuclear_mw | crude_mw | solar_mw | wind_mw | other_mw | source | period | longitude | latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 11570 | 1 | Sand Point | 63560 | TDX Sand Point Generating, LLC | Electric Utility | 100 Power Plant Way | Sand Point | Aleutians East | Alaska | 99661.0 | petroleum | Petroleum = 1.3 MW, Wind = 0.4 MW | Petroleum Liquids; Onshore Wind Turbine; | 3.7 | 1.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.3 | NaN | 0.4 | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -160.497222 | 55.339722 |
1 | 1.0 | 11571 | 2 | Bankhead Dam | 195 | Alabama Power Co | Electric Utility | 19001 Lock 17 Road | Northport | Tuscaloosa | Alabama | 35476.0 | hydroelectric | Hydroelectric = 53 MW | Conventional Hydroelectric | 53.9 | 53.0 | NaN | NaN | NaN | NaN | 53.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -87.356823 | 33.458665 |
2 | 2.0 | 11572 | 3 | Barry | 195 | Alabama Power Co | Electric Utility | North Highway 43 | Bucks | Mobile | Alabama | 36512.0 | natural gas | Coal = 1118.5 MW, Natural Gas = 1296.2 MW | Conventional Steam Coal; Natural Gas Fired Com... | 2569.5 | 2414.7 | NaN | NaN | 1118.5 | NaN | NaN | NaN | 1296.2 | NaN | NaN | NaN | NaN | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -88.010300 | 31.006900 |
# update column names to small caps
= power_plants.columns.str.lower() power_plants.columns
This csv has longitude and latitude columns, which indicate the location of the power plants in the NAD83 CRS (EPSG:4269). We use this information to create a new gpd.GeoDataFrame
from the pd.DataFrame
using the GeoPandas function points_from_xy()
like this:
= gpd.GeoDataFrame(power_plants, # data
power_plants # specify geometry column
=gpd.points_from_xy(power_plants.longitude,
geometry
power_plants.latitude),# specify CRS
='EPSG:4269'
crs )
Check we now have a geometry column:
3) power_plants.head(
unnamed: 0 | objectid | plant_code | plant_name | utility_id | utility_name | sector_name | street_address | city | county | state | zip | primsource | source_desc | tech_desc | install_mw | total_mw | bat_mw | bio_mw | coal_mw | geo_mw | hydro_mw | hydrops_mw | ng_mw | nuclear_mw | crude_mw | solar_mw | wind_mw | other_mw | source | period | longitude | latitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 11570 | 1 | Sand Point | 63560 | TDX Sand Point Generating, LLC | Electric Utility | 100 Power Plant Way | Sand Point | Aleutians East | Alaska | 99661.0 | petroleum | Petroleum = 1.3 MW, Wind = 0.4 MW | Petroleum Liquids; Onshore Wind Turbine; | 3.7 | 1.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.3 | NaN | 0.4 | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -160.497222 | 55.339722 | POINT (-160.49722 55.33972) |
1 | 1.0 | 11571 | 2 | Bankhead Dam | 195 | Alabama Power Co | Electric Utility | 19001 Lock 17 Road | Northport | Tuscaloosa | Alabama | 35476.0 | hydroelectric | Hydroelectric = 53 MW | Conventional Hydroelectric | 53.9 | 53.0 | NaN | NaN | NaN | NaN | 53.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -87.356823 | 33.458665 | POINT (-87.35682 33.45867) |
2 | 2.0 | 11572 | 3 | Barry | 195 | Alabama Power Co | Electric Utility | North Highway 43 | Bucks | Mobile | Alabama | 36512.0 | natural gas | Coal = 1118.5 MW, Natural Gas = 1296.2 MW | Conventional Steam Coal; Natural Gas Fired Com... | 2569.5 | 2414.7 | NaN | NaN | 1118.5 | NaN | NaN | NaN | 1296.2 | NaN | NaN | NaN | NaN | NaN | EIA-860, EIA-860M and EIA-923 | 202305.0 | -88.010300 | 31.006900 | POINT (-88.01030 31.00690) |
Let’s see some information about the CRS of our power plants dataset:
# print information about the
print('is geographic?: ', power_plants.crs.is_geographic)
print('is projected?: ', power_plants.crs.is_projected)
print('datum: ', power_plants.crs.datum)
print('ellipsoid: ', power_plants.crs.ellipsoid)
power_plants.crs
is geographic?: True
is projected?: False
datum: North American Datum 1983
ellipsoid: GRS 1980
<Geographic 2D CRS: EPSG:4269>
Name: NAD83
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands.
- bounds: (167.65, 14.92, -47.74, 86.46)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
Now that we have a geometry column and a CRS, we can plot our dataset:
power_plants.plot()
<AxesSubplot:>
15.3 TIGER shapefile
Next, we import the TIGER shapefile:
# read-in data
= gpd.read_file('data/tl_2022_us_state/tl_2022_us_state.shp')
states # update column names to small caps
= states.columns.str.lower()
states.columns
states.head()
region | division | statefp | statens | geoid | stusps | name | lsad | mtfcc | funcstat | aland | awater | intptlat | intptlon | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 5 | 54 | 01779805 | 54 | WV | West Virginia | 00 | G4000 | A | 62266456923 | 489045863 | +38.6472854 | -080.6183274 | POLYGON ((-77.75438 39.33346, -77.75422 39.333... |
1 | 3 | 5 | 12 | 00294478 | 12 | FL | Florida | 00 | G4000 | A | 138962819934 | 45971472526 | +28.3989775 | -082.5143005 | MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ... |
2 | 2 | 3 | 17 | 01779784 | 17 | IL | Illinois | 00 | G4000 | A | 143778515726 | 6216539665 | +40.1028754 | -089.1526108 | POLYGON ((-87.89243 38.28285, -87.89334 38.282... |
3 | 2 | 4 | 27 | 00662849 | 27 | MN | Minnesota | 00 | G4000 | A | 206244837557 | 18937184315 | +46.3159573 | -094.1996043 | POLYGON ((-95.31989 48.99892, -95.31747 48.998... |
4 | 3 | 5 | 24 | 01714934 | 24 | MD | Maryland | 00 | G4000 | A | 25151771744 | 6979295311 | +38.9466584 | -076.6744939 | POLYGON ((-75.75600 39.24607, -75.75579 39.243... |
Let’s see some information about the CRS of our states geodataframe:
# print information about the CRS
states.crs
<Geographic 2D CRS: EPSG:4269>
Name: NAD83
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: North America - onshore and offshore: Canada - Alberta; British Columbia; Manitoba; New Brunswick; Newfoundland and Labrador; Northwest Territories; Nova Scotia; Nunavut; Ontario; Prince Edward Island; Quebec; Saskatchewan; Yukon. Puerto Rico. United States (USA) - Alabama; Alaska; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Hawaii; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming. US Virgin Islands. British Virgin Islands.
- bounds: (167.65, 14.92, -47.74, 86.46)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
And plot it:
states.plot()
<AxesSubplot:>
Notice the map covers a big extension, this is because, according to the TIGER shapefiles metadata:
In addition to the fifty states, the Census Bureau treats the District of Columbia, Puerto Rico, and the Island areas (American Samoa, the Commonwealth of the Northern Mariana Islands, Guam, and the U.S. Virgin Islands) as statistical equivalents of states for the purpose of data presentation.
In this US Census Bureau file we can see what each code for the region, division, and state corresponds to. These should be numeric codes, so we can start by updating the corresponding columns in the states geo-dataframe:
# notice region, division, and statefp are strings (object) types
states.dtypes
region object
division object
statefp object
statens object
geoid object
stusps object
name object
lsad object
mtfcc object
funcstat object
aland int64
awater int64
intptlat object
intptlon object
geometry geometry
dtype: object
# update dtypes of code columns
= states.region.astype('int')
states.region = states.division.astype('int')
states.division = states.statefp.astype('int') states.statefp
States corresponds to regions 1 through 4. However, there’s also a region code 9. These rows correspond to non-state regions:
print(states.region.unique())
==9] states[states.region
[3 2 1 4 9]
region | division | statefp | statens | geoid | stusps | name | lsad | mtfcc | funcstat | aland | awater | intptlat | intptlon | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 9 | 0 | 78 | 01802710 | 78 | VI | United States Virgin Islands | 00 | G4000 | A | 348021909 | 1550236187 | +18.3392359 | -064.9500433 | MULTIPOLYGON (((-64.76834 18.26033, -64.77074 ... |
35 | 9 | 0 | 69 | 01779809 | 69 | MP | Commonwealth of the Northern Mariana Islands | 00 | G4000 | A | 472292521 | 4644252458 | +15.0010865 | +145.6181702 | MULTIPOLYGON (((145.05897 14.12500, 145.06302 ... |
36 | 9 | 0 | 66 | 01802705 | 66 | GU | Guam | 00 | G4000 | A | 543555849 | 934337453 | +13.4417451 | +144.7719021 | POLYGON ((144.56343 13.44806, 144.56357 13.450... |
41 | 9 | 0 | 60 | 01802701 | 60 | AS | American Samoa | 00 | G4000 | A | 197759069 | 1307243751 | -14.2671590 | -170.6682674 | MULTIPOLYGON (((-170.53809 -14.33613, -170.548... |
49 | 9 | 0 | 72 | 01779808 | 72 | PR | Puerto Rico | 00 | G4000 | A | 8869029522 | 4922249087 | +18.2176480 | -066.4107992 | MULTIPOLYGON (((-66.32322 17.87767, -66.33170 ... |
We can check that Alaska and the non-state regions are causing the long map:
# plot data in states that is not Alaska (code 2) and doesn't have region code 9
!=2) & (states.region!=9)].plot() states[(states.statefp
<AxesSubplot:>
15.4 Data selection
For the pupose of this exercise, we want to keep only data for the contiguous states. Let’s overwrite the geo-dataframes accordingly:
= states[(states.region!=9) & (~states.statefp.isin([2,15]))]
states = power_plants[~power_plants.state.isin(['Puerto Rico','Hawaii','Alaska'])] power_plants
~
= not
In the previous code we used the syntax
~df.column.isin([val1, val2, val3])
The ~
tilde symbol is used in Python to negate a statement. So the previous line could be read as “the values in df’s column which are not in the list [val1, val2, val3].”
15.5 Plotting
Before we plot our data, let’s make sure they are in the same CRS:
== power_plants.crs states.crs
True
We can now try plotting both datasets together. To color the power_plants dots according to color we just need
= plt.subplots()
fig, ax
# add states
=ax,
states.plot(ax='none',
color= 'slategray')
edgecolor
# add electric power plants colored by energy source
=ax,
power_plants.plot(ax='primsource', # color points according to primsource value
column=True, # add legend
legend= 4, # adjust point size
markersize ='tab20', # this color map has 20 different colors
cmap=0.5)
alpha
plt.show()
And we can finish by adding more information to contextualize our map:
# figsize updates the figure size
= plt.subplots(figsize=(10, 6))
fig, ax
# --------------------------
# remove the axis box around the map
'off')
ax.axis(
# update title
'Operable electric generating plants in the contiguous United States',
ax.set_title(=20)
fontsize
# annotate the data source
"Data: U.S. Energy Information Administration (EIA), accessed Oct 30, 2023 \nhttps://atlas.eia.gov/datasets/eia::power-plants/about",
ax.annotate(=(0.25, .06), # position
xy='figure fraction',
xycoords=10,
fontsize='#555555')
color
# --------------------------
# add states
=ax,
states.plot(ax='none',
color= '#362312')
edgecolor
# add electric power plants colored by energy source
=ax,
power_plants.plot(ax='primsource',
column=True,
legend= 4,
markersize ='tab20',
cmap=0.5,
alpha# adjust legend
={'loc': "lower right",
legend_kwds'title':'Primary energy source',
'title_fontsize':'small',
'fontsize':'small'})
<AxesSubplot:title={'center':'Operable electric generating plants in the contiguous United States'}>