How similar are cheeses depending on their country of origin? This project uses principal component analysis (PCA) on the nutrition data for 45 types of cheeses from 10 countries to explore this question.
This report includes a principal component analysis (PCA) on the nutrition data for 45 types of cheeses and examine whether cheeses with the same country of origin form data clusters. The variables used in the analysis are energy (kcal), protein (g), fat (g), carbohydrate (g) and sugar (g) present in \(x\) grams of cheese. It is assumed that the quantity \(x\) is the same for all cheeses. The variables were chosen according to what is usually most visible in the “Nutrition Facts” label in food products. A PCA analysis is appropriate given the sample size (greater than 10) and since all the indicated variables are continuous. The assumption that variables are linearly related is also needed.
The data comes from the Food nutrient information for raw fruits and veggies from USDA database of the FoodData Central in the U.S. Department of Agriculture. https://fdc.nal.usda.gov/index.html
All analyses are in R version 4.0.2 using RStudio Version 1.3.1093.
# ---- DATA READING ----
food <- read.csv(here("Task2_PCA","usda_nutrients.csv")) %>%
clean_names() %>%
drop_na() # remove NA values
# ----- DATA SELECTION -----
cheeses <- food %>%
filter(str_detect(short_descrip, "CHEESE,"), # select cheeses
str_detect(food_group, "Dairy") ) %>%
select(!(id:short_descrip) & !(common_name:scientific_name)) %>% # select variables used in analysis
select(descrip:sugar_g) %>%
mutate(descrip = str_sub(descrip, start=9)) %>% # remove "cheese" from description
separate(descrip, # separate cheese name from characteristics
into=c("name", "type"),
sep=",",
extra="merge") %>%
filter(!str_detect(name,"fat|pasteurized|sodium")) %>% # remove "altered" cheeses
filter(!str_detect(type,"fat|pasteurized|sodium|with|low|imitation") | is.na(type)) %>%
mutate(origin = case_when( # label each cheese by country of origin
name == "mexican" ~ "mex",
str_detect(type,"queso")~ "mex",
str_detect(name,"cheddar|cottage|monterey|muenster|brick|caraway|colby") ~ "us",
str_detect(name,"fontina|parmesan|mozzarella|provolone|ricotta|romano") ~ "it",
str_detect(name,"camembert|brie|roquefort|port de salut|blue")~ "fr",
str_detect(name,"gruyere|neufchatel|swiss|tilsit") ~ "sw",
str_detect(name,"gouda|edam|limburger") ~ "nth"
)) %>%
relocate(origin) # group categorical variables at beginning
# ---- PRINCIPAL COMPONENT ANALYSIS ----
cheeses_PCA <- cheeses %>%
select(4:8) %>% # select numerical variables
scale() %>%
prcomp
# ----- BIPLOT -----
cheese_biplot <- autoplot(cheeses_PCA, # automatically recognizes PCA and does biplot
data= cheeses,
colour = 'origin',
#shape=FALSE, # takes out points and shows individual labels
#label.size = 4,
loadings = TRUE,
loadings.label=TRUE) +
theme_minimal() +
labs(colour="Country of origin") + # add labels
scale_color_hue(labels= c("France", "Italy", "Mexico","Netherlands","Switzerland","USA", "Other"))
cheese_biplot
Figure 2. PCA biplot showing the relationships between energy (kcal), protein (g), fat (g), carbohydrate (g) and sugar (g) present in \(x\) grams of cheese. Each dot represents a different type of cheese. The color of a dote indicates the country of origin of the cheese.
The PCA for our cheeses data set reveals that 71.32% of the variability is explained by the first two principal components. The biplot in figure 2 shows there is a similar and large variance with respect to PC1 and PC2 across energy, fat and protein content, while sugar content has a small variance. Figure 1 also shows the following major relations between the variables:
Sugar and protein present in cheese are negatively correlated.
Carbohydrates seem to be quite unrelated to all other variables.
Energy and fat are positively correlated.
As for data clusters with respect to the cheese’s country of origin the biplot shows that:
Cheeses from the US form the tightest cluster, with cottage cheese as an outlier on the far right. Mexican cheeses also form a small cluster with small coefficients on both PC1 and PC2.
Italian cheeses are quite spread out along PC1, with shredded parmesan on the far left and ricotta on the far right.
Gjetost cheese (a kind of Norwegian cheese) on the top seems like the biggest outlier.