9  * Lab: Data cleaning

October 19, 2023.

Notebook Setup
  1. Create a new repository on GitHub. Use the following settings:

    1. Call the repository eds-220-sections. We’ll try reusing this same repository for all upcoming sections.

    2. Add a brief description for your new repository. For example: EDS 220 sections - practice sessions for environmental data analysis.

    3. Keep the repository public.

    4. Initialize the repository with a README file and a Python .gitignore template.

  1. In the Taylor server, start a new JupyterLab session or access an active one.

  2. Using the terminal, clone the repository to a new directory under your eds-220 directory.

  3. In the terminal, navigate into the eds-220-sections directory and verify eds-220-sections is your current working directory.

  4. Create a new Python Notebook in eds-220-sections.

  5. Update the notebook’s name to something useful like ‘exercise-hares-data.ipynb’.

  6. Use the terminal to stage, commit, and push this file to the remote repository. Remember:

    • stage: git add FILE_NAME
    • commit with message: git commit -m "COMMIT_MESSAGE"
    • push: git push
  7. If you are prompted for your credentials and need to set up a new Personal Access Token (PAT) follow steps 13-18 in this guide to set it up.

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING

General directions
  • Add comments in each one of your code cells
  • Include markdown cells in between your code cells to add titles/information to each exercise
  • Indications about when to commit and push changes are included, but you are encouraged to commit and push more often.
  • You won’t need to upload any data.
About the data

For this exercise we will use data about Snowshoe hares (Lepus americanus) in the Bonanza Creek Experimental Forest (Kielland et al., 2017).

This dataset is stored in the Environmental Data Initiative (EDI) data repository. This is a huge data repository committed to make data Findable, Accessible, Interoperable, and Reusable (FAIR). It is the main repository for all the data associated to the Long Term Ecological Research Network (LTER).

9.1 Archive exploration

Take some time to look through the dataset’s description in EDI and click around. Discuss the following questions with your team:

  1. What is this data about?
  2. During what time frame were the observations in the dataset collected?
  3. Does the dataset contain sensitive data?
  4. Is there a publication associated with this dataset?

In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.

9.2 Adding an image

Follow these steps to add an image of a hare using a URL:

  1. Go to https://commons.wikimedia.org/wiki/File:SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head,washington_co,maine-01%2818988734889%29.jpg

  2. Get the URL of the hare image. To do this:

  • hover over the image –> right click –> “Copy Image Address”.
  1. At the end of the markdown cell with the dataset description, use markdown sytanx to add the image from its URL: ![image description](URL-goes-here)

  2. Do you need to add an attribution in the image description? Check the license at the bottom of wikimedia page.

commit and push changes

9.3 Data loading

Back in your notebook, import the pandas package in a code cell and import the 55_Hare_Data_2012.txt from its URL using the pandas.read_csv(). Store it in a variable named hares. Take a look at the head of the dataframe.

commit and push changes

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING

9.4 Metadata exploration

Back in the EDI repository, click on View Full Metadata to access more information.

Go to the “Detailed Metadata” section and click on “Data Entities”. Take a minute to look at the descriptions for the dataset’s columns.

9.5 Detecting messy values

  1. Get the number of teams each unique non-NA value in the sex column appears by running hares.sex.value_counts().

  2. Check the documentation of value_counts(). What is the purpose of the dropna=False parameter? Do step 2 again, this time adding the dropna=False parameter to value_counts().

  3. Discuss with your team the output of the unique value counts. Notice anything odd?

  4. You likely noticed there seems to be some repeated values, for example m appears twice. Use the unique() method on the sex column to see the unique non-NA values in this column. Discuss with your team what is the cause of the seemingly repeated values.

  5. In the metadata section of the EDI repository, find which are the allowed values for the sex column. Discuss with your team whether these values correspond to the values present in the dataset.

commit and push changes

9.6 Clean values

  1. Use np.select like we did on Monday to create a new column called sex_simple
  • ‘F’, ‘f’, and ‘f’ get assigned to ‘female’,
  • ‘M’, ‘m’, and ‘m’ get assigned to ‘male’, and
  • anything else gets assigned np.nan

HINTS:

  1. You need to create a list with two conditions and a list with two choices.
  2. To write the condition think about what does (hares.sex=='F') | (hares.sex=='f') mean? Do you need to add anything else?
  1. Check the counts of unique values (including NAs) in the new sex_simple column.

commit and push changes

9.7 Calculate mean weight

  1. Use groupby() to calculate the mean weight by sex (use the new column).

commit and push changes

10 References

Kielland, K., F.S. Chapin, R.W. Ruess, and Bonanza Creek LTER. 2017. Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present ver 22. Environmental Data Initiative. https://doi.org/10.6073/pasta/03dce4856d79b91557d8e6ce2cbcdc14 (Accessed 2023-10-18).