7  Updating data frames

In this lesson we will introduce methods for updating a pandas data frame, these include adding and removing columns and updating specific values.

7.1 Data: Palmer penguins

We will use the Palmer penguins dataset (Horst et al., 2020). This time we will import it via the seaborn package since it is included as one of seaborn’s example datasets.

seaborn

seaborn is a library to make nice looking graphs in Python. We’ve been using plot() to make basic exploratory graphs. seaborn is the next step if you’re interested in making publication-level statistical graphics.

# standard libraries
import pandas as pd
import numpy as np

# importing seaborn with its standard abbreviation
import seaborn as sns

# will use the random library to create some random numbers
import random

penguins = sns.load_dataset("penguins")

# look at dataframe's head
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

7.2 Adding a single column

The general syntax to adding a new column to a pd.DataFrame is

df['new_col_name'] = new_column_values

where new_column values could be:

  • a pd.Series or numpy.array of the same length as the data frame
  • a single scalar

Example

Suppose we want to create a new column where the body mass is in kilograms instead of grams, so we need to divide the body_mass_g by 1000.

# add a new column body_mass_kg 
# sane syntax as adding a new key to a dictionary
penguins['body_mass_kg'] = penguins.body_mass_g/1000

# confirm the new column is in the data frame
print('body_mass_kg' in penguins.columns)

# take a look at the new column
penguins.head()
True
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex body_mass_kg
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 3.75
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 3.80
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 3.25
3 Adelie Torgersen NaN NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female 3.45
pd.DataFrames and dictionaries

Remember we talked about how a pandas.DataFrame could be seen as a dictionary of its columns. The most basic way of adding a new column to a data frame is the same way of adding a new key-value pair to a dictionary:

# adding a new key-value pair to a dictionary
# same as adding a new column in pandas
dict[new_key] = new_value

Notice the new column got added at the end. If we want to create a new column and insert it at a particular position we use the data frame method insert():

df.insert(loc = integer_index,  # location of new column
          column = 'new_col_name', 
          value = new_col_values)

Example

Suppose each penguin observation gets a unique identifier as a three digit number. We want to add this column, at the beginning of the data frame. We can use insert to do this:

# create random 3-digit codes
# random.sample used for random sampling wo replacement
codes = random.sample(range(100,1000), len(penguins))

# insert codes at the front of data frame = index 0
penguins.insert(loc=0, 
                column = 'code',
                value = codes)
        
penguins.head()
code species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex body_mass_kg
0 330 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 3.75
1 128 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 3.80
2 530 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 3.25
3 796 Adelie Torgersen NaN NaN NaN NaN NaN NaN
4 692 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female 3.45

7.3 Adding multiple columns

We can assign multiple columns in the same call by using the data frame’s assign() method. The general syntax is:

df = df.assign( new_col1_name = new_col1_values, 
           new_col2_name = new_col2_values)

Notice the new column names are not strings, we declare them as if we were creating variables.

Example

Suppose we want to add these new columns:

  • flipper length converted from mm to cm, and
  • a code representing the observer.

We can add these columns to penguins using assign():

# create new columns in the data frame
# random.choices used for random sampling with replacement
# need to reassign output of assign() to update the data frame
penguins = penguins.assign( flipper_length_cm = penguins.flipper_length_mm /10, 
                            observer =   random.choices(['A','B','C'], k=len(penguins)))
# look at result
penguins.head()
code species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex body_mass_kg flipper_length_cm observer
0 330 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 3.75 18.1 C
1 128 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 3.80 18.6 C
2 530 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 3.25 19.5 B
3 796 Adelie Torgersen NaN NaN NaN NaN NaN NaN NaN A
4 692 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female 3.45 19.3 A

7.4 Removing columns

We can remove columns using the drop() method for data frames, the syntax is:

df = df.drop(columns = col_names)

where col_names can be a single column name (str) or a list of column names. The drop() method does not modify the data frame in place, so you need to reassign the output.

Example

Now that we updated the units for flipper length and body mass, it makes sense to remove the previous columns to avoid duplicate information. We can do this using drop():

# use a list of column names
# reassign output of drop() to dataframe to update it
penguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])

# check columns
print(penguins.columns)
Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'body_mass_kg', 'flipper_length_cm', 'observer'],
      dtype='object')

7.5 Updating values

Sometimes we want to update certain value within our data frame. We’ll review some methods and best practices to do that in this section.

7.5.1 A single value

We can access a single value in a pd.DataFrame using the locators

  • at[] to select by labels, or
  • iat[] to select by position.

The syntax for at[] is:

df.at[single_index_value, 'column_name']

Think of at[] as the equivalent to loc[] when trying to access a single value.

Example

Let’s say we want to know what was the bill length of the penguin in the fourth row. We can access that using at[]:

# access value at row with index=3 and column='bill_length_mm'
penguins.at[3,'bill_length_mm']
nan

We get this is an NA. Maybe we want to update it to 38.3 mm. We can do this with at[] too:

# update NA to 38.3
penguins.at[3,'bill_length_mm'] = 38.3

# check it was updated
penguins.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer
0 330 Adelie Torgersen 39.1 18.7 Male 3.75 18.1 C
1 128 Adelie Torgersen 39.5 17.4 Female 3.80 18.6 C
2 530 Adelie Torgersen 40.3 18.0 Female 3.25 19.5 B
3 796 Adelie Torgersen 38.3 NaN NaN NaN NaN A
4 692 Adelie Torgersen 36.7 19.3 Female 3.45 19.3 A

If we want to access or update a single value by position we use the iat[] locator.

7.5.2 Multiple values in a column

What if we want to update multiple values in a column? We’ll cover two cases: with a condition on the column values and by selecting a few values to update.

7.5.2.1 By condition

Often we want to create a new column where the new values depend on conditions on another column’s values.

Example

Suppose we want to classify all penguins with body mass less than 3kg as small, penguins with body mass greater or equal than 3kg but less than 5kg as medium, and those with body mass greater or equal than 5kg as large. One way to add this information in a new column using numpy.select():

# create a list with the conditions
conditions = [penguins.body_mass_kg < 3, 
              (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
              5 <= penguins.body_mass_kg]

# create a list with the choices
choices = ["small",
           "medium",
           "large"]

# add the selections using np.select
# default = value for anything that falls outside conditions
penguins['size'] = np.select(conditions, choices, default=np.nan)

penguins.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer size
0 330 Adelie Torgersen 39.1 18.7 Male 3.75 18.1 C medium
1 128 Adelie Torgersen 39.5 17.4 Female 3.80 18.6 C medium
2 530 Adelie Torgersen 40.3 18.0 Female 3.25 19.5 B medium
3 796 Adelie Torgersen 38.3 NaN NaN NaN NaN A nan
4 692 Adelie Torgersen 36.7 19.3 Female 3.45 19.3 A medium

7.5.2.2 By selecting values

When we only want to update some values in a column we can do this by selecting this data using loc (if selecting by labels) or iloc (if selecting by position). The general sytanx for updating data with loc is:

# modifies data in place
df.loc[row_selection, col_name] = new_values

where

  • row_selection is the rows we want to update,
  • col_name is a single column name, and
  • new_values is the new value or values we want. If using multiple values, then new_values must be of the same length as the number of rows selected,

Example

Suppose we want to update the “Male” value in the sex column to “M”.

# select rows with sex=male and update the values in the sex column
penguins.loc[penguins.sex=='Male', 'sex'] = 'M'

# check changes
penguins.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer size
0 330 Adelie Torgersen 39.1 18.7 M 3.75 18.1 C medium
1 128 Adelie Torgersen 39.5 17.4 Female 3.80 18.6 C medium
2 530 Adelie Torgersen 40.3 18.0 Female 3.25 19.5 B medium
3 796 Adelie Torgersen 38.3 NaN NaN NaN NaN A nan
4 692 Adelie Torgersen 36.7 19.3 Female 3.45 19.3 A medium

7.6 SettingWithCopyWarning

Suppose we want to similarly update the “Female” values in the sex column to “F”. This is an example of another way we might try to do it, but it won’t update our data frame:

# select rows and columns with two selection brackets [][] (chained indexing), instead of loc[]
# then trying to update the values
penguins[penguins.sex=='Female']['sex'] = 'F'
/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

When we select the data we want to update using chained indexing ([][]) instead of loc[] we get a SettingWithCopyWarning. Wit this warning, pandas is trying to alert us to a potential bug. In this case that we did not update our data frame:

# no values were updated
penguins.sex.unique()
array(['M', 'Female', nan], dtype=object)
Warning

The SettingWithCopyWarning is letting us know our code is ambiguous and we need to double check the code is doing what we intended. In the words of the pandas documentation:

assigning to the product of chained indexing has inherently unpredictable results.

To see what is going on we need to understand that some pandas operations return a view to your data, while others return a copy of your data.

  • Views are actual subsets of the original data, when we update them, we are modifying the original data frame.

  • Copies are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.

Depending on what we are trying to do we might want to modify the original data frame or we might want to modify a copy.

Check-in

Update the “Female” values in the penguins data frame to “F”. Don’t use chained indexing

# no chained indexing in assignment = no warning
penguins.loc[penguins.sex=='Female','sex'] = 'F'

# notice the values were updated now
penguins.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer size
0 330 Adelie Torgersen 39.1 18.7 M 3.75 18.1 C medium
1 128 Adelie Torgersen 39.5 17.4 F 3.80 18.6 C medium
2 530 Adelie Torgersen 40.3 18.0 F 3.25 19.5 B medium
3 796 Adelie Torgersen 38.3 NaN NaN NaN NaN A nan
4 692 Adelie Torgersen 36.7 19.3 F 3.45 19.3 A medium

7.6.0.1 Another SettingWithCopyWarning

Another common situation where SettingWithCopyWarning comes up is when we try to update a subset of a data frame we have already stored in a new variable.

Example

Suppose we only want to use data from Biscoe island and, after doing some analyses, we want to add a new column to it:

# select penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe']

# 50 lines of code here

# add a column, we get a warning
biscoe['sample_col'] = 100
/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

It might not seem like it, but we have a chained assignment here too. We essentialy wrote:

penguins[penguins.island=='Biscoe']['sample_col'] = 100

To fix this we can take control of the copy-view situation and explicitely ask for a copy of the dataset when subsetting the data. Use the copy() method to do this:

# make sure you get a new data frame with penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe'].copy()

# add a column, no warning
biscoe['sample_col'] = 100

Now we are sure we did not modify our initial data, but rather the biscoe data frame:

# new column is there
biscoe.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer size sample_col
20 666 Adelie Biscoe 37.8 18.3 F 3.40 17.4 A medium 100
21 348 Adelie Biscoe 37.7 18.7 M 3.60 18.0 C medium 100
22 753 Adelie Biscoe 35.9 19.2 F 3.80 18.9 C medium 100
23 513 Adelie Biscoe 38.2 18.1 M 3.95 18.5 C medium 100
24 893 Adelie Biscoe 38.8 17.2 M 3.80 18.0 A medium 100
# and we are sure our original df wasn't modified
penguins.head()
code species island bill_length_mm bill_depth_mm sex body_mass_kg flipper_length_cm observer size
0 330 Adelie Torgersen 39.1 18.7 M 3.75 18.1 C medium
1 128 Adelie Torgersen 39.5 17.4 F 3.80 18.6 C medium
2 530 Adelie Torgersen 40.3 18.0 F 3.25 19.5 B medium
3 796 Adelie Torgersen 38.3 NaN NaN NaN NaN A nan
4 692 Adelie Torgersen 36.7 19.3 F 3.45 19.3 A medium

The SettingWithCopyWarning can often be tricky, there are also false positives and false negatives. Avoiding chained indexing and making a copy of your data frame subset whenever possible will save you from the usual pitfalls.

To learn more about the SettingWithCopyWarning, these are some articles that go into more depth:

📖 pandas Documentation - Returning a view versus a copy

📖 Real Python- SettingWithCopyWarning in pandas: Views vs Copies

📖 Dataquest - SettingwithCopyWarning: How to Fix This Warning in Pandas

7.7 References

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi:10.5281/zenodo.3960218.