9 * Lab: Data cleaning
October 19, 2023.
Create a new repository on GitHub. Use the following settings:
Call the repository
eds-220-sections
. We’ll try reusing this same repository for all upcoming sections.Add a brief description for your new repository. For example: EDS 220 sections - practice sessions for environmental data analysis.
Keep the repository public.
Initialize the repository with a
README
file and a Python .gitignore template.
In the Taylor server, start a new JupyterLab session or access an active one.
Using the terminal, clone the repository to a new directory under your
eds-220
directory.In the terminal, navigate into the
eds-220-sections
directory and verifyeds-220-sections
is your current working directory.Create a new Python Notebook in
eds-220-sections
.Update the notebook’s name to something useful like ‘exercise-hares-data.ipynb’.
Use the terminal to stage, commit, and push this file to the remote repository. Remember:
- stage:
git add FILE_NAME
- commit with message:
git commit -m "COMMIT_MESSAGE"
- push:
git push
- stage:
If you are prompted for your credentials and need to set up a new Personal Access Token (PAT) follow steps 13-18 in this guide to set it up.
CHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING
- Add comments in each one of your code cells
- Include markdown cells in between your code cells to add titles/information to each exercise
- Indications about when to commit and push changes are included, but you are encouraged to commit and push more often.
- You won’t need to upload any data.
For this exercise we will use data about Snowshoe hares (Lepus americanus) in the Bonanza Creek Experimental Forest (Kielland et al., 2017).
This dataset is stored in the Environmental Data Initiative (EDI) data repository. This is a huge data repository committed to make data Findable, Accessible, Interoperable, and Reusable (FAIR). It is the main repository for all the data associated to the Long Term Ecological Research Network (LTER).
9.1 Archive exploration
Take some time to look through the dataset’s description in EDI and click around. Discuss the following questions with your team:
- What is this data about?
- During what time frame were the observations in the dataset collected?
- Does the dataset contain sensitive data?
- Is there a publication associated with this dataset?
In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.
9.2 Adding an image
Follow these steps to add an image of a hare using a URL:
Get the URL of the hare image. To do this:
- hover over the image –> right click –> “Copy Image Address”.
At the end of the markdown cell with the dataset description, use markdown sytanx to add the image from its URL:
![image description](URL-goes-here)
Do you need to add an attribution in the image description? Check the license at the bottom of wikimedia page.
commit and push changes
9.3 Data loading
Back in your notebook, import the pandas
package in a code cell and import the 55_Hare_Data_2012.txt
from its URL using the pandas.read_csv()
. Store it in a variable named hares
. Take a look at the head of the dataframe.
commit and push changes
CHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING
9.4 Metadata exploration
Back in the EDI repository, click on View Full Metadata to access more information.
Go to the “Detailed Metadata” section and click on “Data Entities”. Take a minute to look at the descriptions for the dataset’s columns.
9.5 Detecting messy values
Get the number of teams each unique non-NA value in the sex column appears by running
hares.sex.value_counts()
.Check the documentation of
value_counts()
. What is the purpose of thedropna=False
parameter? Do step 2 again, this time adding thedropna=False
parameter tovalue_counts()
.Discuss with your team the output of the unique value counts. Notice anything odd?
You likely noticed there seems to be some repeated values, for example
m
appears twice. Use theunique()
method on the sex column to see the unique non-NA values in this column. Discuss with your team what is the cause of the seemingly repeated values.In the metadata section of the EDI repository, find which are the allowed values for the sex column. Discuss with your team whether these values correspond to the values present in the dataset.
commit and push changes
9.6 Clean values
- Use
np.select
like we did on Monday to create a new column calledsex_simple
- ‘F’, ‘f’, and ‘f’ get assigned to ‘female’,
- ‘M’, ‘m’, and ‘m’ get assigned to ‘male’, and
- anything else gets assigned
np.nan
HINTS:
- You need to create a list with two conditions and a list with two choices.
- To write the condition think about what does
(hares.sex=='F') | (hares.sex=='f')
mean? Do you need to add anything else?
- Check the counts of unique values (including NAs) in the new
sex_simple
column.
commit and push changes
9.7 Calculate mean weight
- Use
groupby()
to calculate the mean weight by sex (use the new column).
commit and push changes
10 References
Kielland, K., F.S. Chapin, R.W. Ruess, and Bonanza Creek LTER. 2017. Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present ver 22. Environmental Data Initiative. https://doi.org/10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14 (Accessed 2023-10-18).