Principal Component Analysis (PCA): exploring similarities among cheeses

How similar are cheeses depending on their country of origin? This project uses principal component analysis (PCA) on the nutrition data for 45 types of cheeses from 10 countries to explore this question.

Carmen Galaz-García true
01-25-2021

Introduction

This report includes a principal component analysis (PCA) on the nutrition data for 45 types of cheeses and examine whether cheeses with the same country of origin form data clusters. The variables used in the analysis are energy (kcal), protein (g), fat (g), carbohydrate (g) and sugar (g) present in \(x\) grams of cheese. It is assumed that the quantity \(x\) is the same for all cheeses. The variables were chosen according to what is usually most visible in the “Nutrition Facts” label in food products. A PCA analysis is appropriate given the sample size (greater than 10) and since all the indicated variables are continuous. The assumption that variables are linearly related is also needed.

The data comes from the Food nutrient information for raw fruits and veggies from USDA database of the FoodData Central in the U.S. Department of Agriculture. https://fdc.nal.usda.gov/index.html

All analyses are in R version 4.0.2 using RStudio Version 1.3.1093.

'Cheese' by rjhuttondfw is licensed under CC BY 2.0

Figure 1: ‘Cheese’ by rjhuttondfw is licensed under CC BY 2.0

Data wrangling

Show code
# ---- DATA READING ----
food <- read.csv(here("Task2_PCA","usda_nutrients.csv")) %>% 
  clean_names() %>%      
  drop_na()              # remove NA values

# ----- DATA SELECTION -----
cheeses <- food %>% 
  filter(str_detect(short_descrip, "CHEESE,"),   # select cheeses
         str_detect(food_group, "Dairy") ) %>% 
  select(!(id:short_descrip) & !(common_name:scientific_name)) %>%   # select variables used in analysis
  select(descrip:sugar_g) %>% 
  mutate(descrip = str_sub(descrip, start=9)) %>% # remove "cheese" from description
  separate(descrip,                               # separate cheese name from characteristics
           into=c("name", "type"), 
           sep=",", 
           extra="merge") %>%
  filter(!str_detect(name,"fat|pasteurized|sodium")) %>%     # remove "altered" cheeses
  filter(!str_detect(type,"fat|pasteurized|sodium|with|low|imitation") | is.na(type)) %>% 
  mutate(origin = case_when(                      # label each cheese by country of origin
    name == "mexican" ~ "mex",
    str_detect(type,"queso")~ "mex",
    str_detect(name,"cheddar|cottage|monterey|muenster|brick|caraway|colby") ~ "us",
    str_detect(name,"fontina|parmesan|mozzarella|provolone|ricotta|romano") ~ "it",
    str_detect(name,"camembert|brie|roquefort|port de salut|blue")~ "fr",
    str_detect(name,"gruyere|neufchatel|swiss|tilsit") ~ "sw",
    str_detect(name,"gouda|edam|limburger") ~ "nth"
  )) %>% 
  relocate(origin)    # group categorical variables at beginning

Principal Component Analysis and Biplot

Show code
# ---- PRINCIPAL COMPONENT ANALYSIS ----
cheeses_PCA <- cheeses %>% 
  select(4:8) %>%    # select numerical variables
  scale() %>% 
  prcomp

# ----- BIPLOT -----
cheese_biplot <- autoplot(cheeses_PCA,   # automatically recognizes PCA and does biplot
         data= cheeses,
         colour = 'origin',
         #shape=FALSE,      # takes out points and shows individual labels
         #label.size = 4,
         loadings = TRUE,
         loadings.label=TRUE)  +
  theme_minimal()  +
  labs(colour="Country of origin") +  # add labels
  scale_color_hue(labels= c("France", "Italy", "Mexico","Netherlands","Switzerland","USA", "Other"))

cheese_biplot

Figure 2. PCA biplot showing the relationships between energy (kcal), protein (g), fat (g), carbohydrate (g) and sugar (g) present in \(x\) grams of cheese. Each dot represents a different type of cheese. The color of a dote indicates the country of origin of the cheese.

Exploratory Findings

The PCA for our cheeses data set reveals that 71.32% of the variability is explained by the first two principal components. The biplot in figure 2 shows there is a similar and large variance with respect to PC1 and PC2 across energy, fat and protein content, while sugar content has a small variance. Figure 1 also shows the following major relations between the variables:

As for data clusters with respect to the cheese’s country of origin the biplot shows that: