6  Plotting

Let’s learn the basics of plotting with pandas to make things more interesting.

To get us started, we will use again the simplified data (glacial_loss.csv) from the National Snow and Ice Data Center (Original dataset). The column descriptions are:

import pandas as pd

# read in file
df = pd.read_csv('data/lesson-1/glacial_loss.csv')

# see the first five rows
df.head()
year europe arctic alaska asia north_america south_america antarctica global_glacial_volume_change annual_sea_level_rise cumulative_sea_level_rise
0 1961 -5.128903 -108.382987 -18.721190 -32.350759 -14.359007 -4.739367 -35.116389 -220.823515 0.610010 0.610010
1 1962 5.576282 -173.252450 -24.324790 -4.675440 -2.161842 -13.694367 -78.222887 -514.269862 0.810625 1.420635
2 1963 -10.123105 -0.423751 -2.047567 -3.027298 -27.535881 3.419633 3.765109 -550.575640 0.100292 1.520927
3 1964 -4.508358 20.070148 0.477800 -18.675385 -2.248286 20.732633 14.853096 -519.589859 -0.085596 1.435331
4 1965 10.629385 43.695389 -0.115332 -18.414602 -19.398765 6.862102 22.793484 -473.112003 -0.128392 1.306939

6.1 plot() method

A pandas.DataFrame has a built-in method plot() for plotting. When we call it without specifying any other parameters plot() creates one line plot for each of the columns with numeric data.

# one line plot per column with numeric data - a mess
df.plot()
<AxesSubplot:>

As we can see, this doesn’t make any sense! In particular, look at the x-axis. The default for plot is to use the values of the index as the x-axis values. Let’s see some examples about how to improve this situation.

6.2 Line plots

We can make a line plot of one column against another by using the following syntax:

df.plot(x='x_values_column', y='y_values_column')

For example,

# change in glacial volume per year in Europe
df.plot(x='year', y='europe')
<AxesSubplot:xlabel='year'>

We can do some basic customization specifying other arguments of the plot function. Some basic ones are:

  • title: Title to use for the plot.
  • xlabel: Name to use for the xlabel on x-axis
  • ylabel: Name to use for the ylabel on y-axis
  • color: change the color of our plot

In action:

df.plot(x='year', 
        y='europe',
        title='Change in glacial volume per year in Europe',
        xlabel='Year',
        ylabel='​Change in glacial volume (km3​)',
        color='green'
        )
<AxesSubplot:title={'center':'Change in glacial volume per year in Europe'}, xlabel='Year', ylabel='\u200bChange in glacial volume (km3\u200b)'>

You can see all the optional arguments for the plot() function in the documentation.

6.2.1 Multiple line plots

Let’s say we want to graph the change in glacial volume in the Arctic and Alaska. We can do it by updating these arguments:

  • y : a list of column names that will be plotted against x
  • color: specify the color of each column’s line with a dictionary {'col_1' : 'color_1', 'col_2':'color_2}
df.plot(x='year', 
        y=['arctic', 'alaska'],
        title = 'Change in glacial volume per year in Alaska and the Arctic',
        xlabel='Year',
        ylabel='​Change in glacial volume (km3​)',        
        color = {'arctic':'#F48FB1',
                 'alaska': '#AB47BC'
                 }
        )
<AxesSubplot:title={'center':'Change in glacial volume per year in Alaska and the Arctic'}, xlabel='Year', ylabel='\u200bChange in glacial volume (km3\u200b)'>

Notice that for specifying the colors we used a HEX code, this gives us more control over how our graph looks.

We can also create separate plots for each column by setting the subset to True.

df.plot(x='year', 
        y=['arctic', 'alaska'],
        title = 'Change in glacial volume per year in Alaska and the Arctic',
        xlabel='Year',
        ylabel='​Change in glacial volume (km3​)',        
        color = {'arctic':'#F48FB1',
                 'alaska': '#AB47BC'
                 },
        subplots=True
        )
array([<AxesSubplot:xlabel='Year', ylabel='\u200bChange in glacial volume (km3\u200b)'>,
       <AxesSubplot:xlabel='Year', ylabel='\u200bChange in glacial volume (km3\u200b)'>],
      dtype=object)

6.2.2 Check-in

  1. Plot a graph of the annual sea level rise with respect to the years.

  2. What information is the columns variable retrieving from the data frame? Describe in a sentence what is being plotted.

columns = df.loc[:,'europe':'antarctica'].columns
df.plot(x='year', 
        y=columns, 
        subplots=True)

We will move on to another dataset for the rest of the lecture. The great…

6.3 Palmer penguins dataset

For the next plots we will use the Palmer Penguins dataset (Horst et al., 2020). This contains size measurements for three penguin species in the Palmer Archipelago, Antarctica.

The Palmer Archipelago penguins. Artwork by @allison_horst.

The data is usually accessed through the palmerpenguins R data package. Today we will access the csv directly into Python using the URL: https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv

The Palmer penguins dataset has the following columns:

  • species
  • island
  • bill_length_mm
  • bill_depth_mm
  • flipper_lenght_mm
  • body_mass_g
  • sex
  • year

Let’s start by reading in the data.

# read in data
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv')

# look at dataframe's head
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
# check column data types and NA values
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
# simple statistics about numeric columns
penguins.describe()
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
count 342.000000 342.000000 342.000000 342.000000 344.000000
mean 43.921930 17.151170 200.915205 4201.754386 2008.029070
std 5.459584 1.974793 14.061714 801.954536 0.818356
min 32.100000 13.100000 172.000000 2700.000000 2007.000000
25% 39.225000 15.600000 190.000000 3550.000000 2007.000000
50% 44.450000 17.300000 197.000000 4050.000000 2008.000000
75% 48.500000 18.700000 213.000000 4750.000000 2009.000000
max 59.600000 21.500000 231.000000 6300.000000 2009.000000

We can also subset the dataframe to get information about a particular column or groups of columns.

# get count unique values in categorical columns and year
penguins[['species', 'island', 'sex', 'year']].nunique()
species    3
island     3
sex        2
year       3
dtype: int64
# get unique values in species column
print(penguins.species.unique())
['Adelie' 'Gentoo' 'Chinstrap']
# species unique value counts 
print(penguins.species.value_counts())
Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

6.4 kind argument in plot()

We talked about how the plot() function creates by default a line plot. The parameter that controls this behaviour is plot()’s kind parameter. By changing the value of kind we can create different kinds of plots. Let’s look at the documentation to see what these values are:

pandas.DataFrame.plot documentation extract - accessed Oct 10,2023

Notice the default value of kind is 'line'.

Let’s change the kind parameter to create some different plots.

6.5 Scatter plots

Suppose we want to visualy compare the flipper length against the body mass, we can do this with a scatterplot.

Example:

penguins.plot(kind='scatter',
        x='flipper_length_mm', 
        y='body_mass_g')
<AxesSubplot:xlabel='flipper_length_mm', ylabel='body_mass_g'>

We can update some other arguments to customize the graph:

penguins.plot(kind='scatter',
        x='flipper_length_mm', 
        y='body_mass_g',
        title='Flipper length and body mass for Palmer penguins',
        xlabel='Flipper length (mm)',
        ylabel='Body mass (g)',
        color='#ff3b01',
        alpha=0.4  # controls transparency
        )
<AxesSubplot:title={'center':'Flipper length and body mass for Palmer penguins'}, xlabel='Flipper length (mm)', ylabel='Body mass (g)'>

6.6 Bar plots

We can create bar plots of our data setting kind='bar' in the plot() method.

For example, let’s say we want to get data about the 10 penguins with lowest body mass. We can first select this data using the nsmallest() method for series:

smallest = penguins.body_mass_g.nsmallest(10).sort_values()
smallest
314    2700.0
58     2850.0
64     2850.0
54     2900.0
98     2900.0
116    2900.0
298    2900.0
104    2925.0
47     2975.0
44     3000.0
Name: body_mass_g, dtype: float64

We can then plot this data as a bar plot

smallest.plot(kind='bar')
<AxesSubplot:>

If we wanted to look at other data for these smallest penguins we can use the index of the smallest pandas.Series to select those rows in the original penguins data frame using loc:

penguins.loc[smallest.index]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
314 Chinstrap Dream 46.9 16.6 192.0 2700.0 female 2008
58 Adelie Biscoe 36.5 16.6 181.0 2850.0 female 2008
64 Adelie Biscoe 36.4 17.1 184.0 2850.0 female 2008
54 Adelie Biscoe 34.5 18.1 187.0 2900.0 female 2008
98 Adelie Dream 33.1 16.1 178.0 2900.0 female 2008
116 Adelie Torgersen 38.6 17.0 188.0 2900.0 female 2009
298 Chinstrap Dream 43.2 16.6 187.0 2900.0 female 2007
104 Adelie Biscoe 37.9 18.6 193.0 2925.0 female 2009
47 Adelie Dream 37.5 18.9 179.0 2975.0 NaN 2007
44 Adelie Dream 37.0 16.9 185.0 3000.0 female 2007

6.7 Histograms

We can create a histogram of our data setting kind='hist' in plot().

# using plot without subsetting data - a mess again
penguins.plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>

To gain actual information, let’s subset the data before plotting it. For example, suppose we want to look at the distribution of flipper length. We could do it in this way:

# distribution of flipper length measurements
# first select data, then plot
penguins.flipper_length_mm.plot(kind='hist',
                                title='Penguin flipper lengths',
                                xlabel='Flipper length (mm)',
                                grid=True)
<AxesSubplot:title={'center':'Penguin flipper lengths'}, ylabel='Frequency'>

6.7.1 Check-in

  1. Select the bill_length_mm and bill_depth_mm columns in the penguins dataframe and then update the kind parameter to box to make boxplots of the bill length and bill depth.

  2. Select both rows and columns to create a histogram of the flipper length of gentoo penguins.

6.8 References

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi:10.5281/zenodo.3960218.