8 `groupby()`

In this section we will go over the groupby() function and the split-apply-combine strategy. We will use the Palmer penguins data.

import pandas as pd 

# read in Palmer penguins data
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

8.1 Summary statistics

In pandas it is easy to get summary statistics for each column in a dataframe by using methods such as

sum(): sum values in each column,
count(): count non-NA values in each column,
min() and max(): get the minimum and maximum value in each column,
mean() and median(): get the mean and median value in each column,
std() and var(): get the standard deviation and variance in each column.

Example

# get the number of non-NA values in each column 
penguins.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
year                 344
dtype: int64

# get the minimum value in each column
penguins.select_dtypes('number').min()

bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

8.2 Grouping

Our penguins data is naturally split into different groups: there are three different species, two sexes and three islands. Often, we want to calculate a certain statistic for each group. For example, suppose we want to calculate the average flipper length per species. How would we do this “by hand”?

We start with our data and notice there are multiple species in the species column.
We split our original table to group all observations from the same species together.
We calculate the average flipper length for each of the groups we formed.
Then we combine the values for average flipper length per species into a single table.

This is known as the Split-Apply-Combine strategy. This strategy follows the three steps we explained above:

Split: Split the data into logical groups (e.g. species, sex, island, etc.)
Apply: Calculate some summary statistic on each group (e.g. average flipper length by species, number of individuals per island, body mass by sex, etc.)
Combine: Combine the statistic calculated on each group back together.

Split-apply-combine to calculate mean flipper length

In Python we can use the groupby() method to split (i.e. group) the data into different categories. The general syntax for groupby() is

df.groupby(columns_to_group_by)

where most often we will have columns_to_group_by = a single column name (string) or a list of column names, the unique values of the column (or columns) will be used as the groups of the data frame

Example

First, if we don’t use the groupby() method, we obtain the average for the whole flipper length column:

penguins.flipper_length_mm.mean()

200.91520467836258

To get the mean flipper length by species we first group our dataset by the species column’s data:

# average flipper length per species
penguins.groupby("species").flipper_length_mm.mean()

species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64

There’s a lot going on there, let’s break it down (remember the . can be read as “and then…”)

start with the penguins data frame, and then…
use groupby() to group the data frame by species values, and then…
select the 'flipper_length_mm' column, and then…
calculate the mean()

We can store our new data frame as avg_flipper and then graph it as a bar plot:

# average flipper length per species
avg_flipper = penguins.groupby("species").flipper_length_mm.mean()

# sort in descending order
# remember to reassign to actually update data
avg_flipper = avg_flipper.sort_values(ascending=False)

# plot as a bar plot
avg_flipper.plot(kind='bar',
                 title='Average flipper length by species',
                 ylabel='Flipper length (mm)',
                 rot=360  # rotation for ticks 
                 )

<AxesSubplot:title={'center':'Average flipper length by species'}, xlabel='species', ylabel='Flipper length (mm)'>

We can also group by combinations of columns.

Example

Suppose we want to know what was the number of penguins in every island on different years. We can use the count() method to count the number of non-NA values in each column like this:

penguins.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
year                 344
dtype: int64

When we group by island and year we get the count of non-NA values for each column divided for each combination of island and year:

penguins.groupby(['island','year']).count()

		species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
island	year
Biscoe	2007	44	44	44	44	44	43
	2008	64	64	64	64	64	63
	2009	60	59	59	59	59	57
Dream	2007	46	46	46	46	46	45
	2008	34	34	34	34	34	34
	2009	44	44	44	44	44	44
Torgersen	2007	20	19	19	19	19	15
	2008	16	16	16	16	16	16
	2009	16	16	16	16	16	16

Let’s say we want to plot the surveyed population per year and island. We could then use a “one liner” to do it in this way:

penguins.groupby(['island','year']).count().species.sort_values().plot(kind='barh')

<AxesSubplot:ylabel='island,year'>

8.2.1 Check-in

Use groupby and the max() method for data frames to calculate the maximum value of penguin’s body mass by year and species.

Use (1) to display the highest body masses per year and species as a bar plot in descending order.

8.3 Acknowledgmenets

This section is based on this NCEAS Learning Hub R lesson.

Halina Do-Linh, Carmen Galaz García, Matthew B. Jones, Camila Vargas Poulsen. 2023. Open Science Synthesis training Week 1. NCEAS Learning Hub & Delta Stewardship Council.