7 Updating data frames

In this lesson we will introduce methods for updating a pandas data frame, these include adding and removing columns and updating specific values.

7.1 Data: Palmer penguins

We will use the Palmer penguins dataset (Horst et al., 2020). This time we will import it via the seaborn package since it is included as one of seaborn’s example datasets.

seaborn

seaborn is a library to make nice looking graphs in Python. We’ve been using plot() to make basic exploratory graphs. seaborn is the next step if you’re interested in making publication-level statistical graphics.

# standard libraries
import pandas as pd
import numpy as np

# importing seaborn with its standard abbreviation
import seaborn as sns

# will use the random library to create some random numbers
import random

penguins = sns.load_dataset("penguins")

# look at dataframe's head
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

7.2 Adding a single column

The general syntax to adding a new column to a pd.DataFrame is

df['new_col_name'] = new_column_values

where new_column values could be:

a pd.Series or numpy.array of the same length as the data frame
a single scalar

Example

Suppose we want to create a new column where the body mass is in kilograms instead of grams, so we need to divide the body_mass_g by 1000.

# add a new column body_mass_kg 
# sane syntax as adding a new key to a dictionary
penguins['body_mass_kg'] = penguins.body_mass_g/1000

# confirm the new column is in the data frame
print('body_mass_kg' in penguins.columns)

# take a look at the new column
penguins.head()

True

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	body_mass_kg
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	3.75
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	3.80
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	3.25
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	3.45

pd.DataFrames and dictionaries

Remember we talked about how a pandas.DataFrame could be seen as a dictionary of its columns. The most basic way of adding a new column to a data frame is the same way of adding a new key-value pair to a dictionary:

# adding a new key-value pair to a dictionary
# same as adding a new column in pandas
dict[new_key] = new_value

Notice the new column got added at the end. If we want to create a new column and insert it at a particular position we use the data frame method insert():

df.insert(loc = integer_index,  # location of new column
          column = 'new_col_name', 
          value = new_col_values)

Example

Suppose each penguin observation gets a unique identifier as a three digit number. We want to add this column, at the beginning of the data frame. We can use insert to do this:

# create random 3-digit codes
# random.sample used for random sampling wo replacement
codes = random.sample(range(100,1000), len(penguins))

# insert codes at the front of data frame = index 0
penguins.insert(loc=0, 
                column = 'code',
                value = codes)
        
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	body_mass_kg
0	330	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	3.75
1	128	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	3.80
2	530	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	3.25
3	796	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	NaN
4	692	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	3.45

7.3 Adding multiple columns

We can assign multiple columns in the same call by using the data frame’s assign() method. The general syntax is:

df = df.assign( new_col1_name = new_col1_values, 
           new_col2_name = new_col2_values)

Notice the new column names are not strings, we declare them as if we were creating variables.

Example

Suppose we want to add these new columns:

flipper length converted from mm to cm, and
a code representing the observer.

We can add these columns to penguins using assign():

# create new columns in the data frame
# random.choices used for random sampling with replacement
# need to reassign output of assign() to update the data frame
penguins = penguins.assign( flipper_length_cm = penguins.flipper_length_mm /10, 
                            observer =   random.choices(['A','B','C'], k=len(penguins)))
# look at result
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	body_mass_kg	flipper_length_cm	observer
0	330	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male	3.75	18.1	C
1	128	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female	3.80	18.6	C
2	530	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female	3.25	19.5	B
3	796	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A
4	692	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female	3.45	19.3	A

7.4 Removing columns

We can remove columns using the drop() method for data frames, the syntax is:

df = df.drop(columns = col_names)

where col_names can be a single column name (str) or a list of column names. The drop() method does not modify the data frame in place, so you need to reassign the output.

Example

Now that we updated the units for flipper length and body mass, it makes sense to remove the previous columns to avoid duplicate information. We can do this using drop():

# use a list of column names
# reassign output of drop() to dataframe to update it
penguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])

# check columns
print(penguins.columns)

Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'body_mass_kg', 'flipper_length_cm', 'observer'],
      dtype='object')

7.5 Updating values

Sometimes we want to update certain value within our data frame. We’ll review some methods and best practices to do that in this section.

7.5.1 A single value

We can access a single value in a pd.DataFrame using the locators

at[] to select by labels, or
iat[] to select by position.

The syntax for at[] is:

df.at[single_index_value, 'column_name']

Think of at[] as the equivalent to loc[] when trying to access a single value.

Example

Let’s say we want to know what was the bill length of the penguin in the fourth row. We can access that using at[]:

# access value at row with index=3 and column='bill_length_mm'
penguins.at[3,'bill_length_mm']

nan

We get this is an NA. Maybe we want to update it to 38.3 mm. We can do this with at[] too:

# update NA to 38.3
penguins.at[3,'bill_length_mm'] = 38.3

# check it was updated
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer
0	330	Adelie	Torgersen	39.1	18.7	Male	3.75	18.1	C
1	128	Adelie	Torgersen	39.5	17.4	Female	3.80	18.6	C
2	530	Adelie	Torgersen	40.3	18.0	Female	3.25	19.5	B
3	796	Adelie	Torgersen	38.3	NaN	NaN	NaN	NaN	A
4	692	Adelie	Torgersen	36.7	19.3	Female	3.45	19.3	A

If we want to access or update a single value by position we use the iat[] locator.

7.5.2 Multiple values in a column

What if we want to update multiple values in a column? We’ll cover two cases: with a condition on the column values and by selecting a few values to update.

7.5.2.1 By condition

Often we want to create a new column where the new values depend on conditions on another column’s values.

Example

Suppose we want to classify all penguins with body mass less than 3kg as small, penguins with body mass greater or equal than 3kg but less than 5kg as medium, and those with body mass greater or equal than 5kg as large. One way to add this information in a new column using numpy.select():

# create a list with the conditions
conditions = [penguins.body_mass_kg < 3, 
              (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
              5 <= penguins.body_mass_kg]

# create a list with the choices
choices = ["small",
           "medium",
           "large"]

# add the selections using np.select
# default = value for anything that falls outside conditions
penguins['size'] = np.select(conditions, choices, default=np.nan)

penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer	size
0	330	Adelie	Torgersen	39.1	18.7	Male	3.75	18.1	C	medium
1	128	Adelie	Torgersen	39.5	17.4	Female	3.80	18.6	C	medium
2	530	Adelie	Torgersen	40.3	18.0	Female	3.25	19.5	B	medium
3	796	Adelie	Torgersen	38.3	NaN	NaN	NaN	NaN	A	nan
4	692	Adelie	Torgersen	36.7	19.3	Female	3.45	19.3	A	medium

7.5.2.2 By selecting values

When we only want to update some values in a column we can do this by selecting this data using loc (if selecting by labels) or iloc (if selecting by position). The general sytanx for updating data with loc is:

# modifies data in place
df.loc[row_selection, col_name] = new_values

where

row_selection is the rows we want to update,
col_name is a single column name, and
new_values is the new value or values we want. If using multiple values, then new_values must be of the same length as the number of rows selected,

Example

Suppose we want to update the “Male” value in the sex column to “M”.

# select rows with sex=male and update the values in the sex column
penguins.loc[penguins.sex=='Male', 'sex'] = 'M'

# check changes
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer	size
0	330	Adelie	Torgersen	39.1	18.7	M	3.75	18.1	C	medium
1	128	Adelie	Torgersen	39.5	17.4	Female	3.80	18.6	C	medium
2	530	Adelie	Torgersen	40.3	18.0	Female	3.25	19.5	B	medium
3	796	Adelie	Torgersen	38.3	NaN	NaN	NaN	NaN	A	nan
4	692	Adelie	Torgersen	36.7	19.3	Female	3.45	19.3	A	medium

7.6 `SettingWithCopyWarning`

Suppose we want to similarly update the “Female” values in the sex column to “F”. This is an example of another way we might try to do it, but it won’t update our data frame:

# select rows and columns with two selection brackets [][] (chained indexing), instead of loc[]
# then trying to update the values
penguins[penguins.sex=='Female']['sex'] = 'F'

/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

When we select the data we want to update using chained indexing ([][]) instead of loc[] we get a SettingWithCopyWarning. Wit this warning, pandas is trying to alert us to a potential bug. In this case that we did not update our data frame:

# no values were updated
penguins.sex.unique()

array(['M', 'Female', nan], dtype=object)

Warning

The SettingWithCopyWarning is letting us know our code is ambiguous and we need to double check the code is doing what we intended. In the words of the pandas documentation:

assigning to the product of chained indexing has inherently unpredictable results.

To see what is going on we need to understand that some pandas operations return a view to your data, while others return a copy of your data.

Views are actual subsets of the original data, when we update them, we are modifying the original data frame.
Copies are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.

Depending on what we are trying to do we might want to modify the original data frame or we might want to modify a copy.

Check-in

Update the “Female” values in the penguins data frame to “F”. Don’t use chained indexing

# no chained indexing in assignment = no warning
penguins.loc[penguins.sex=='Female','sex'] = 'F'

# notice the values were updated now
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer	size
0	330	Adelie	Torgersen	39.1	18.7	M	3.75	18.1	C	medium
1	128	Adelie	Torgersen	39.5	17.4	F	3.80	18.6	C	medium
2	530	Adelie	Torgersen	40.3	18.0	F	3.25	19.5	B	medium
3	796	Adelie	Torgersen	38.3	NaN	NaN	NaN	NaN	A	nan
4	692	Adelie	Torgersen	36.7	19.3	F	3.45	19.3	A	medium

7.6.0.1 Another `SettingWithCopyWarning`

Another common situation where SettingWithCopyWarning comes up is when we try to update a subset of a data frame we have already stored in a new variable.

Example

Suppose we only want to use data from Biscoe island and, after doing some analyses, we want to add a new column to it:

# select penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe']

# 50 lines of code here

# add a column, we get a warning
biscoe['sample_col'] = 100

/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

It might not seem like it, but we have a chained assignment here too. We essentialy wrote:

penguins[penguins.island=='Biscoe']['sample_col'] = 100

To fix this we can take control of the copy-view situation and explicitely ask for a copy of the dataset when subsetting the data. Use the copy() method to do this:

# make sure you get a new data frame with penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe'].copy()

# add a column, no warning
biscoe['sample_col'] = 100

Now we are sure we did not modify our initial data, but rather the biscoe data frame:

# new column is there
biscoe.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer	size	sample_col
20	666	Adelie	Biscoe	37.8	18.3	F	3.40	17.4	A	medium	100
21	348	Adelie	Biscoe	37.7	18.7	M	3.60	18.0	C	medium	100
22	753	Adelie	Biscoe	35.9	19.2	F	3.80	18.9	C	medium	100
23	513	Adelie	Biscoe	38.2	18.1	M	3.95	18.5	C	medium	100
24	893	Adelie	Biscoe	38.8	17.2	M	3.80	18.0	A	medium	100

# and we are sure our original df wasn't modified
penguins.head()

	code	species	island	bill_length_mm	bill_depth_mm	sex	body_mass_kg	flipper_length_cm	observer	size
0	330	Adelie	Torgersen	39.1	18.7	M	3.75	18.1	C	medium
1	128	Adelie	Torgersen	39.5	17.4	F	3.80	18.6	C	medium
2	530	Adelie	Torgersen	40.3	18.0	F	3.25	19.5	B	medium
3	796	Adelie	Torgersen	38.3	NaN	NaN	NaN	NaN	A	nan
4	692	Adelie	Torgersen	36.7	19.3	F	3.45	19.3	A	medium

The SettingWithCopyWarning can often be tricky, there are also false positives and false negatives. Avoiding chained indexing and making a copy of your data frame subset whenever possible will save you from the usual pitfalls.

To learn more about the SettingWithCopyWarning, these are some articles that go into more depth:

📖 pandas Documentation - Returning a view versus a copy

📖 Real Python- SettingWithCopyWarning in pandas: Views vs Copies

📖 Dataquest - SettingwithCopyWarning: How to Fix This Warning in Pandas

7.7 References

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi:10.5281/zenodo.3960218.

7.1 Data: Palmer penguins

7.2 Adding a single column

7.3 Adding multiple columns

7.4 Removing columns

7.5 Updating values

7.5.1 A single value

7.5.2 Multiple values in a column

7.5.2.1 By condition

7.5.2.2 By selecting values

7.6 SettingWithCopyWarning

7.6.0.1 Another SettingWithCopyWarning

7.7 References

7.6 `SettingWithCopyWarning`

7.6.0.1 Another `SettingWithCopyWarning`