4 * Lab: Preliminary data exploration
October 5, 2023.
This discussion section will guide you through preliminary data exploration for a real world dataset about animal observations in the California drylands. Our goals are to:
- Keep practicing setting up a GitHub repository and using
git commit
git push
- Collaborate with your new team!
- Practice getting preliminary information from a dataset from its archive
- Introduce
pd.read_csv()
for loading files directly from an URL - Introduce preliminary data exploration strategies in
pandas
Create a new repository on GitHub. Use the following settings:
Call the repository
eds-220-section-1
.Add a brief description for your new repository. For example: EDS 220 section - practice session for data selection in pandas.
Keep the repository public.
Initialize the repository with a
README
file and a Python .gitignore template.
In the Taylor server, start a new JupyterLab session or access an active one.
Using the terminal, clone the repository to a new directory under your
eds-220
directory.In the terminal, use
cd
to navigate into theeds-220-section-1
directory. Usepwd
to verifyeds-220-section-1
is your current working directory.Create a new Python Notebook in
eds-220-section-1
.Update the notebook’s name to something useful like ‘exercise-data-selection.ipynb’.
Use the terminal to stage, commit, and push this file to the remote repository. Remember:
- stage:
git add FILE_NAME
- commit with message:
git commit -m "COMMIT_MESSAGE"
- push:
git push
- stage:
If you are prompted for your credentials and need to set up a new Personal Access Token (PAT) follow steps 13-18 in this guide to set it up.
CHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING
- Add comments in each one of your code cells
- Include markdown cells in between your code cells to add titles/information to each exercise
- Indications about when to commit and push changes are included, but you are welcome to commit and push more often.
- You won’t need to upload any data
For this exercise we will use data about prey items for endangered terrestrial vertebrate species within central California drylands (King et. al, 2023).
This dataset is stored in the Knowledge Network for Biocomplexity (KNB) data repository. This is an international repository intended to facilitate ecological and environmental research. It has thousands of open datasets and is hosted by NCEAS!
4.1 Archive exploration
For many datasets, data exploration begins at the data repository. Take some time to look through the dataset’s description in KNB. Discuss the following questions with your team:
- What is this data about?
- Is this data collected in-situ by the authors or is it a synthesis of multiple datasets?
- During what time frame were the observations in the dataset collected?
- Does this dataset come with an associated metadata file?
- Does the dataset contain sensitive data?
In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.
check git status -> stage changes -> check git status -> commit with message -> push changes
4.2 .xml
metadata exploration
You may have noticed there are two metadata files: Compiled_occurrence_records_for_prey_items_of.xml
and metadata_arth_occurrences.csv
.
- In the archive’s dataset description, notice the
.xml
document file type isEML
which stands for EML: Ecological Metadata Language. - Open the
.xml
file: there’s a lot going on. This is a machine-readable file that has metadata about the whole dataset. You can proably identify some items like title and creators. - Close the file and delete it - we won’t use it today.
- You don’t need to write anything in your notebook about this section.
4.3 .csv
metadata exploration
Back in your notebook, import the pandas
package using standard abbreviation in a code cell. Then follow these steps to read in the metadata csv using the pandas.read_csv()
function:
- Navigate to the data package site and copy the URL to access the
metadata_arth_occurrences
csv file. To copy the URL:
- hover over the Download button –> right click –> “Copy Link”.
Read in the data from the URL using the
pd.read_csv()
function like this:# look at metadata 'the URL goes here') pd.read_csv(
Take a minute to look at the descriptions for the columns.
Note: Not all datasets have column descriptions in a csv
file. Often they come with a doc
or txt
file with information.
4.4 Data loading
- Follow steps (a) and (b) from the previous exercise to read in the drylands prey data file
arth_occurrences_with_env.csv
usingpd.read_csv()
. Store the dataframe to a variable calledprey
like this:
# read in data
= pd.read_csv('the URL goes here') prey
- Use a Python function to see what is the type of the
prey
variable.
check git status -> stage changes -> check git status -> commit with message -> push changes
CHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING
4.5 Look at your data
Run
prey
in a cell. What do you notice in the columns section?To see all the column names in the same display we need to set a
pandas
option. Run the following command and then look at theprey
data again:
"display.max.columns", None) pd.set_option(
- Add a comment explaining what
pd.set_option("display.max.columns", None)
does.
check git status -> stage changes -> check git status -> commit with message -> push changes
4.6 pd.DataFrame
preliminary exploration
Run each of the following methods for prey
in a different cell and write a brief description of what they do as a comment:
head()
tail()
info()
nunique()
For example:
# head()
# returns the first five rows of the data frame
prey.head()
If you’re not sure about what the method does, try looking it up in the pandas.DataFrame
documentation.
- Check the documentation for
head()
. If this function has any optional parameters, change the default value to get a different output.
Print each of the following attributes of prey
in a different cell and write a brief explanation of what they are as a comment:
shape
columns
dtypes
If you’re not sure about what info is the attribute showing, try looking it up in the pandas.DataFrame
documentation.
check git status -> stage changes -> check git status -> commit with message -> push changes
4.7 Update some column names
Change the column names of institutionCode
and datasetKey
to institution_code
and dataset_key
, respectively. Make sure you’re actually updating the dataframe. HINT: yesterday’s class.
check git status -> stage changes -> check git status -> commit with message -> push changes
5 References
Rachel King, Jenna Braun, Michael Westphal, & CJ Lortie. (2023). Compiled occurrence records for prey items of listed species found in California drylands with associated environmental data. Knowledge Network for Biocomplexity. doi:10.5063/F1VM49RH.
Lortie, C. J., Braun, J., King, R., & Westphal, M. (2023). The importance of open data describing prey item species lists for endangered species. Ecological Solutions and Evidence, 4(2), e12251. https://doi.org/10.1002/2688-8319.12251