In this lesson we will introduce methods for updating a pandas data frame, these include adding and removing columns and updating specific values.
7.1 Data: Palmer penguins
We will use the Palmer penguins dataset (Horst et al., 2020). This time we will import it via the seaborn package since it is included as one of seaborn’s example datasets.
seaborn
seaborn is a library to make nice looking graphs in Python. We’ve been using plot() to make basic exploratory graphs. seaborn is the next step if you’re interested in making publication-level statistical graphics.
# standard librariesimport pandas as pdimport numpy as np# importing seaborn with its standard abbreviationimport seaborn as sns# will use the random library to create some random numbersimport randompenguins = sns.load_dataset("penguins")# look at dataframe's headpenguins.head()
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
0
Adelie
Torgersen
39.1
18.7
181.0
3750.0
Male
1
Adelie
Torgersen
39.5
17.4
186.0
3800.0
Female
2
Adelie
Torgersen
40.3
18.0
195.0
3250.0
Female
3
Adelie
Torgersen
NaN
NaN
NaN
NaN
NaN
4
Adelie
Torgersen
36.7
19.3
193.0
3450.0
Female
7.2 Adding a single column
The general syntax to adding a new column to a pd.DataFrame is
df['new_col_name'] = new_column_values
where new_column values could be:
a pd.Series or numpy.array of the same length as the data frame
a single scalar
Example
Suppose we want to create a new column where the body mass is in kilograms instead of grams, so we need to divide the body_mass_g by 1000.
# add a new column body_mass_kg # sane syntax as adding a new key to a dictionarypenguins['body_mass_kg'] = penguins.body_mass_g/1000# confirm the new column is in the data frameprint('body_mass_kg'in penguins.columns)# take a look at the new columnpenguins.head()
True
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
body_mass_kg
0
Adelie
Torgersen
39.1
18.7
181.0
3750.0
Male
3.75
1
Adelie
Torgersen
39.5
17.4
186.0
3800.0
Female
3.80
2
Adelie
Torgersen
40.3
18.0
195.0
3250.0
Female
3.25
3
Adelie
Torgersen
NaN
NaN
NaN
NaN
NaN
NaN
4
Adelie
Torgersen
36.7
19.3
193.0
3450.0
Female
3.45
pd.DataFrames and dictionaries
Remember we talked about how a pandas.DataFrame could be seen as a dictionary of its columns. The most basic way of adding a new column to a data frame is the same way of adding a new key-value pair to a dictionary:
# adding a new key-value pair to a dictionary# same as adding a new column in pandasdict[new_key] = new_value
Notice the new column got added at the end. If we want to create a new column and insert it at a particular position we use the data frame method insert():
df.insert(loc = integer_index, # location of new column column ='new_col_name', value = new_col_values)
Example
Suppose each penguin observation gets a unique identifier as a three digit number. We want to add this column, at the beginning of the data frame. We can use insert to do this:
# create random 3-digit codes# random.sample used for random sampling wo replacementcodes = random.sample(range(100,1000), len(penguins))# insert codes at the front of data frame = index 0penguins.insert(loc=0, column ='code', value = codes)penguins.head()
code
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
body_mass_kg
0
330
Adelie
Torgersen
39.1
18.7
181.0
3750.0
Male
3.75
1
128
Adelie
Torgersen
39.5
17.4
186.0
3800.0
Female
3.80
2
530
Adelie
Torgersen
40.3
18.0
195.0
3250.0
Female
3.25
3
796
Adelie
Torgersen
NaN
NaN
NaN
NaN
NaN
NaN
4
692
Adelie
Torgersen
36.7
19.3
193.0
3450.0
Female
3.45
7.3 Adding multiple columns
We can assign multiple columns in the same call by using the data frame’s assign() method. The general syntax is:
Notice the new column names are not strings, we declare them as if we were creating variables.
Example
Suppose we want to add these new columns:
flipper length converted from mm to cm, and
a code representing the observer.
We can add these columns to penguins using assign():
# create new columns in the data frame# random.choices used for random sampling with replacement# need to reassign output of assign() to update the data framepenguins = penguins.assign( flipper_length_cm = penguins.flipper_length_mm /10, observer = random.choices(['A','B','C'], k=len(penguins)))# look at resultpenguins.head()
where col_names can be a single column name (str) or a list of column names. The drop() method does not modify the data frame in place, so you need to reassign the output.
Example
Now that we updated the units for flipper length and body mass, it makes sense to remove the previous columns to avoid duplicate information. We can do this using drop():
# use a list of column names# reassign output of drop() to dataframe to update itpenguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])# check columnsprint(penguins.columns)
Think of at[] as the equivalent to loc[] when trying to access a single value.
Example
Let’s say we want to know what was the bill length of the penguin in the fourth row. We can access that using at[]:
# access value at row with index=3 and column='bill_length_mm'penguins.at[3,'bill_length_mm']
nan
We get this is an NA. Maybe we want to update it to 38.3 mm. We can do this with at[] too:
# update NA to 38.3penguins.at[3,'bill_length_mm'] =38.3# check it was updatedpenguins.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
0
330
Adelie
Torgersen
39.1
18.7
Male
3.75
18.1
C
1
128
Adelie
Torgersen
39.5
17.4
Female
3.80
18.6
C
2
530
Adelie
Torgersen
40.3
18.0
Female
3.25
19.5
B
3
796
Adelie
Torgersen
38.3
NaN
NaN
NaN
NaN
A
4
692
Adelie
Torgersen
36.7
19.3
Female
3.45
19.3
A
If we want to access or update a single value by position we use the iat[] locator.
7.5.2 Multiple values in a column
What if we want to update multiple values in a column? We’ll cover two cases: with a condition on the column values and by selecting a few values to update.
7.5.2.1 By condition
Often we want to create a new column where the new values depend on conditions on another column’s values.
Example
Suppose we want to classify all penguins with body mass less than 3kg as small, penguins with body mass greater or equal than 3kg but less than 5kg as medium, and those with body mass greater or equal than 5kg as large. One way to add this information in a new column using numpy.select():
# create a list with the conditionsconditions = [penguins.body_mass_kg <3, (3<= penguins.body_mass_kg) & (penguins.body_mass_kg <5),5<= penguins.body_mass_kg]# create a list with the choiceschoices = ["small","medium","large"]# add the selections using np.select# default = value for anything that falls outside conditionspenguins['size'] = np.select(conditions, choices, default=np.nan)penguins.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
size
0
330
Adelie
Torgersen
39.1
18.7
Male
3.75
18.1
C
medium
1
128
Adelie
Torgersen
39.5
17.4
Female
3.80
18.6
C
medium
2
530
Adelie
Torgersen
40.3
18.0
Female
3.25
19.5
B
medium
3
796
Adelie
Torgersen
38.3
NaN
NaN
NaN
NaN
A
nan
4
692
Adelie
Torgersen
36.7
19.3
Female
3.45
19.3
A
medium
7.5.2.2 By selecting values
When we only want to update some values in a column we can do this by selecting this data using loc (if selecting by labels) or iloc (if selecting by position). The general sytanx for updating data with loc is:
# modifies data in placedf.loc[row_selection, col_name] = new_values
where
row_selection is the rows we want to update,
col_name is a single column name, and
new_values is the new value or values we want. If using multiple values, then new_values must be of the same length as the number of rows selected,
Example
Suppose we want to update the “Male” value in the sex column to “M”.
# select rows with sex=male and update the values in the sex columnpenguins.loc[penguins.sex=='Male', 'sex'] ='M'# check changespenguins.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
size
0
330
Adelie
Torgersen
39.1
18.7
M
3.75
18.1
C
medium
1
128
Adelie
Torgersen
39.5
17.4
Female
3.80
18.6
C
medium
2
530
Adelie
Torgersen
40.3
18.0
Female
3.25
19.5
B
medium
3
796
Adelie
Torgersen
38.3
NaN
NaN
NaN
NaN
A
nan
4
692
Adelie
Torgersen
36.7
19.3
Female
3.45
19.3
A
medium
7.6SettingWithCopyWarning
Suppose we want to similarly update the “Female” values in the sex column to “F”. This is an example of another way we might try to do it, but it won’t update our data frame:
# select rows and columns with two selection brackets [][] (chained indexing), instead of loc[]# then trying to update the valuespenguins[penguins.sex=='Female']['sex'] ='F'
/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
When we select the data we want to update using chained indexing ([][]) instead of loc[] we get a SettingWithCopyWarning. Wit this warning, pandas is trying to alert us to a potential bug. In this case that we did not update our data frame:
# no values were updatedpenguins.sex.unique()
array(['M', 'Female', nan], dtype=object)
Warning
The SettingWithCopyWarning is letting us know our code is ambiguous and we need to double check the code is doing what we intended. In the words of the pandas documentation:
assigning to the product of chained indexing has inherently unpredictable results.
To see what is going on we need to understand that some pandas operations return a view to your data, while others return a copy of your data.
Views are actual subsets of the original data, when we update them, we are modifying the original data frame.
Copies are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.
Depending on what we are trying to do we might want to modify the original data frame or we might want to modify a copy.
Check-in
Update the “Female” values in the penguins data frame to “F”. Don’t use chained indexing
# no chained indexing in assignment = no warningpenguins.loc[penguins.sex=='Female','sex'] ='F'# notice the values were updated nowpenguins.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
size
0
330
Adelie
Torgersen
39.1
18.7
M
3.75
18.1
C
medium
1
128
Adelie
Torgersen
39.5
17.4
F
3.80
18.6
C
medium
2
530
Adelie
Torgersen
40.3
18.0
F
3.25
19.5
B
medium
3
796
Adelie
Torgersen
38.3
NaN
NaN
NaN
NaN
A
nan
4
692
Adelie
Torgersen
36.7
19.3
F
3.45
19.3
A
medium
7.6.0.1 Another SettingWithCopyWarning
Another common situation where SettingWithCopyWarning comes up is when we try to update a subset of a data frame we have already stored in a new variable.
Example
Suppose we only want to use data from Biscoe island and, after doing some analyses, we want to add a new column to it:
# select penguins from Biscoe islandbiscoe = penguins[penguins.island=='Biscoe']# 50 lines of code here# add a column, we get a warningbiscoe['sample_col'] =100
/opt/python/3.7.13/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
It might not seem like it, but we have a chained assignment here too. We essentialy wrote:
To fix this we can take control of the copy-view situation and explicitely ask for a copy of the dataset when subsetting the data. Use the copy() method to do this:
# make sure you get a new data frame with penguins from Biscoe islandbiscoe = penguins[penguins.island=='Biscoe'].copy()# add a column, no warningbiscoe['sample_col'] =100
Now we are sure we did not modify our initial data, but rather the biscoe data frame:
# new column is therebiscoe.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
size
sample_col
20
666
Adelie
Biscoe
37.8
18.3
F
3.40
17.4
A
medium
100
21
348
Adelie
Biscoe
37.7
18.7
M
3.60
18.0
C
medium
100
22
753
Adelie
Biscoe
35.9
19.2
F
3.80
18.9
C
medium
100
23
513
Adelie
Biscoe
38.2
18.1
M
3.95
18.5
C
medium
100
24
893
Adelie
Biscoe
38.8
17.2
M
3.80
18.0
A
medium
100
# and we are sure our original df wasn't modifiedpenguins.head()
code
species
island
bill_length_mm
bill_depth_mm
sex
body_mass_kg
flipper_length_cm
observer
size
0
330
Adelie
Torgersen
39.1
18.7
M
3.75
18.1
C
medium
1
128
Adelie
Torgersen
39.5
17.4
F
3.80
18.6
C
medium
2
530
Adelie
Torgersen
40.3
18.0
F
3.25
19.5
B
medium
3
796
Adelie
Torgersen
38.3
NaN
NaN
NaN
NaN
A
nan
4
692
Adelie
Torgersen
36.7
19.3
F
3.45
19.3
A
medium
The SettingWithCopyWarning can often be tricky, there are also false positives and false negatives. Avoiding chained indexing and making a copy of your data frame subset whenever possible will save you from the usual pitfalls.
To learn more about the SettingWithCopyWarning, these are some articles that go into more depth: