Logistic regression: classifying palm species

Classification of palm species by their morphological characteristics via logistic regression. Simple machnine learning example.

Carmen Galaz-García true
01-29-2021

Introduction

In this report we use binary logistic regression to test the feasibility of using morphological characteristics to classify whether a palmetto plant is species serenoa repens or species sabal etonia. The variables we use for this classification are the height (cm), widest length of the canopy (cm), widest width of the canopy perpendicular to the canopy length (cm) and number of green leaves (n) in a palmetto plant. Data was collected by the Archbold Biological Station in south-central Florida from 1981 through 1997 then again in 2001 and 2017.

All analyses are in R version 4.0.2 using RStudio Version 1.3.1093.

Data exploration

Show code
# --- DATA SELECTION ---- 

raw_palmetto <- read_csv("palmetto.csv")

palmetto <- raw_palmetto %>% 
  select(species, height:green_lvs) %>% 
  mutate(species = case_when( species==1 ~ "serenoa repens",   # update species names
                              species==2 ~ "sabal etonia"))

palmetto$species<- as.factor(palmetto$species) # convert to factor
#levels(palmetto$species)    # check factor levels
Show code
# --- GRAPH 1: CANOPY DIMENSIONS ---

ggplot(data=palmetto, aes(x=length, y=width))+
  geom_point( cex=0.3, alpha=0.2)+    # point size and transparency
  facet_wrap(~species)+     # divide by species
  labs( x= "length (cm)",   # add labels and title
        y= "width (cm)",
       title = expression(paste(
       "Canopy dimensions in ",
       italic("sabal etonia"),
       " and ",
       italic("serenoa repens"),
       " individuals"))
        )+
  #geom_smooth(method = "lm", se = FALSE, color = "forestgreen")+
  theme_light()

Figure 1. Relationship between widest length of the canopy (cm) and widest width of the canopy perpendicular to the canopy length (cm) of the palmetto plant species sabal etonia and serenoa repens. Each point indicates measurements of an individual plant. Data: Abrahamson (2019)

From figure 1 we can see that canopy width and length for both plant species are positively correlated. Plants from sabal etonia species are slightly longer than those from serenoa repens.

Show code
# --- GRAPH 2: NUMBER OF LEAVES HISTOGRAM  ---

ggplot(data=palmetto, aes(x=green_lvs))+
  geom_histogram(binwidth = 1)+  # update number of bins
  facet_wrap(~species)+                     # divide by species
  labs( x = "number of green leaves (n)",   # add labels and title
        y = "count",
        title = expression(paste(
       "Number of green leaves in ",
       italic("sabal etonia"),
       " and ",
       italic("serenoa repens"),
       " individuals"))
       )+
  theme_light()

Figure 2. Histograms of number of green leaves (n) in the palmetto plant species sabal etonia and serenoa repens. Data: Abrahamson (2019)

In figure 2 we can clearly see that most sabal etonia plants have 5 or less green levaes. These are fewer green leaves than what serenoa repens plants have, serenoa repens plants have a median of about 7 green leaves per plant.

Show code
# --- GRAPH 3: HEIGHT

ggplot(data=palmetto, aes(x=species, y=height))+
  geom_jitter(color="forestgreen", size=0.4, alpha=0.2) +
  geom_boxplot(alpha=0)+
  labs(y = "height (cm)",
        title = expression(paste(
       "Height of ",
       italic("sabal etonia"),
       " and ",
       italic("serenoa repens"),
       " individuals"))
       )+
  theme_minimal()

Show code
# ---- DESCRIPTIVE STATISTICS ----
summary_palmetto <- palmetto %>% 
  group_by(species) %>% 
  summarize(mean = mean(height, na.rm=TRUE),
            sd = sd(height, na.rm=TRUE),
            #median = median(weight, na.rm=TRUE),
            n = n()
            )

# --- TWO SAMPLE T-TEST AND COHEN'S D ----

# --- Convert data to vector form
sabal_height <- palmetto %>% 
  filter(species == "sabal etonia") %>% 
  pull(height)

serenoa_height <- palmetto %>% 
  filter(species == "serenoa repens") %>% 
  pull(height)

# --- Two-sample t-test ---
ttest_height <- t.test(sabal_height,serenoa_height) %>% 
  tidy # make test parameters available for in-line referencing

# --- Cohen's d ---
height_effsize <- cohen.d( serenoa_height,sabal_height, na.rm=TRUE)

Figure 3. Height (cm) of the palmetto plant species sabal etonia and serenoa repens. Each green point indicates the height of an individual plant. Box endpoints indicate the 25th and 75th percentile values, the black line within the box indicate the median for each species. Black points at the end of the boxes indicate possible outliers. Data: Abrahamson (2019)

On average sabal etonia plants are shorter than serenoa repens plants: 90.09 \(\pm\) 24.17 and 92.19 \(\pm\) 28.11 cm, respectively (mean \(\pm\) 1 standard deviation). The absolute difference in means is 2.1 cm, so on average the serenoa repens plants are 2.33% taller than sabal etonia plants. This difference in means is significant (Welch’s two-sample t-test: t(1.19707610^{4}) = -4.44, p< 0.001) however the effect size is small (Cohen’s d = 0.08).

Binary logistic regression

Based on the preliminary data visualization and analysis, it seems possible that the differences in canopy dimensions, height and number of green leaves for serenoa repens and sabal etonia plant species are significant. These characteristics could be used classify whether a palmetto plant belongs to either one of these species. To assess this we used a binary logistic regression to create a classification, the coefficients of the model are shown in the following table.

Table 1. Coefficients of the binary logistic regression model for classifying serenoa repens and sabal etonia palmetto plants based on their height (cm), canopy length and width (cm) and number of green leaves (n). The standard error and p-values for each coefficient are included. Data: Abrahamson (2019).

Show code
# ---- BINARY LOGISTIC REGRESSION ---- 

# -- create model
blr_palmetto <- glm(species ~ height + length + width + green_lvs,
                            data = palmetto,
                            family = "binomial") #specify model is BLR

# -- table with model information 
blr_model <- tidy(blr_palmetto) %>% 
  mutate(pvalue = "< 0.001") %>%     # after looking at p-values, they are all <0.001
  mutate(term = c("y-intercept",     # update variable names
                  "height (cm)", 
                  "length (cm)", 
                  "width (cm)", 
                  "number of green leaves (n)")
         ) %>% 
  select(!statistic & !p.value)%>%   # exclude statistic and p-value columns
  kable( col.names = c("Variable",   # turn into kable and update column names
                        "Coefficient", 
                        "Standard error",
                        "p-value") 
                         ) %>% 
   kable_styling(full_width = FALSE) 

blr_model
Variable Coefficient Standard error p-value
y-intercept -3.2266851 0.1420708 < 0.001
height (cm) 0.0292173 0.0023061 < 0.001
length (cm) -0.0458233 0.0018661 < 0.001
width (cm) -0.0394434 0.0021000 < 0.001
number of green leaves (n) 1.9084747 0.0388634 < 0.001

Model accuracy

Based on the previous binary logistic regression we constructed a probability function \(P\) dependent on a plant’s height \(h\) (cm), canopy length \(l\) (cm), canopy width \(w\) (cm) and number of green leaves \(n\). Then \(P(h,l,w,n)\) is the probability that a palmetto plant with the specified measurements being a serenoa repens. If \(X\) is an individual of species serenoa repens with measurements \(h, l,w\) and \(n\), we consider \(X\) to be correctly classified by the model if \(P(h,l,w,n)>0.5\). Similarly, if \(X\) is an individual of species sabal etonia with measurements \(h, l,w\) and \(n\), we consider \(X\) to be correctly classified by the model if \(P(h,l,w,n)\leq 0.5\).

Table 2. Number and percentages of each species correctly and incorrectly classified by the model.

Show code
# ---- MODEL ACCURACY ---- 

results_blr <- blr_palmetto %>% 
  broom::augment(type.predict="response") %>% # calculate probability for each outcome
  select(species,.fitted) %>%                 # select species and probability columns only
  # check if classifcation is correct
  mutate(classification = case_when( .fitted>0.5 & species=="serenoa repens"~ "True",   
                                     .fitted<=0.5 & species=="serenoa repens"~ "False",
                                     .fitted>0.5 & species=="sabal etonia"~ "False",
                                     .fitted<=0.5 & species=="sabal etonia"~ "True")) %>% 
  count(species, classification) %>%    
  mutate(percentage = case_when(     # add column with percentages 
    species == "sabal etonia" ~ scales::percent(n/(454+5701),accuracy = 0.01),
    species == "serenoa repens" ~ scales::percent(n/(564+5548),accuracy = 0.01))
    ) %>% 
  kable( col.names = c("Species",    # create kable table, update column names
                        "Correctly classified by model", 
                        "Count",
                       "Percentage") 
                         ) %>% 
   kable_styling(full_width = FALSE)

results_blr
Species Correctly classified by model Count Percentage
sabal etonia False 454 7.38%
sabal etonia True 5701 92.62%
serenoa repens False 564 9.23%
serenoa repens True 5548 90.77%

Citation

Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5