Adding some music to boxplots, and better ways to visualize diameter distributions

March 22, 2019
R ggplot2 geom_violin() geom_boxplot()

Boxplots are an old standby in statistics and data analysis. They became popular because they show the distribution of a data set along with a clear visualization of its five-number summary: the minimum, first quartile, median, third quartile and maximum. But how can we unpack what’s going on between those boxes and whiskers?

Some forestry data: In the examples below, I’ve used tree diameters measured in four management units at the Penobscot Experimental Forest in Maine. The four management units are being managed under a selection system, where foresters are harvesting a select number of trees every five or ten years to create an uneven-aged forest.

After plotting the data using boxplots, we can see a greater median in management units 9 and 16. All of the distributions are generally right-skewed, indicating the mean is greater than the median. Outliers in red are identified as large-diameter trees that are more than 1.5 times the size of the interquartile range:

p.tree<-ggplot(tree,aes(MgmtUnit,DBH))+
  xlab("Management unit")+
  ylab("Tree diameter (inches)")+
  theme(legend.position = "none",
        panel.background = element_rect(fill = "NA"),
        axis.line = element_line(color="black"))

p.tree + geom_boxplot(outlier.colour="red",outlier.shape=8)

Violin plots are similar to box plots, except that they also show the kernel probability density of the data for the range of values. Despite the fact that they look like Christmas ornaments, violin plots are more sensitive to changes in the density of a variable of interest.

We can create a violin plot showing the distribution of the tree diameters using geom_violin(). What can we see in the violin plot that we can’t in the boxplot? Some of the distributions display noted “rises” at certain tree diameters, in particular for management units 12 and 20:

p.tree + geom_violin(aes(fill = factor(MgmtUnit)))

We might like to still see some of numerical summaries of the data, say the mean and standard deviations. STHDA highlights some useful code that adds the mean and +/- one standard deviation to the violin plots by calling the data_summary() function:

data_summary <- function(x) {
   m <- mean(x)
   ymin <- m-sd(x)
   ymax <- m+sd(x)
   return(c(y=m,ymin=ymin,ymax=ymax))
}

p.tree + geom_violin(aes(fill = factor(MgmtUnit)))+
  stat_summary(fun.data=data_summary)

If we want the best of both worlds, we can also overlay a boxplot onto a violin plot. Adjusting the width of the boxplot will allow it to fit nicely within the violin plot. Plus, you’ll be able to see how much of the “tails” in the violin plot are represented by the outliers:

p.tree+geom_violin(aes(fill = factor(MgmtUnit)))+
  geom_boxplot(width=0.2)

When you read more about how the Penobscot data were collected, you’ll learn that the forest inventory used different-sized plots depending on the diameter of the trees. Permanent sample plots consisted of nested 1/5-, 1/20-, and 1/50-acre circular plots for measuring trees >= 4.5 inches, 2.5 to 4.4 inches, and 0.5 to 2.4 inches in diameter, respectively.

We might want to know how the distribution of tree diameters differ in the overstory trees (trees measured on the 1/5-acre plot) versus the understory (trees measured on the 1/20- and 1/50-acre plots). Foresters may be interested in knowing how and when the understory trees may be expected to transition into the overstory, and a visual representation of this would help.

jan-glx provides code to produce a split violin plot, which produces a side-by-side comparison of two attributes. Using the geom_split_violin() function, we’re able to see that there are a lot of 1 to 2-inch trees across all management units:

ggplot(tree, aes(MgmtUnit, DBH, fill = factor(treesize))) + 
  geom_split_violin()+
  xlab("Management unit")+
  ylab("Tree diameter (inches)")+
  theme(axis.line = element_line(color="black"),
        panel.background = element_rect(fill = "NA"))

A goal of selection harvesting is to create a reverse-J diameter distribution, with lots of small trees and fewer large trees. We can see this with the violin plots, but aren’t able to see it with a traditional boxplot.

P.S. Thanks to the Penobscot Experimental Forest for their excellent research data catalog that contains many long-term data sets of their great studies.

By Matt Russell. Leave a comment below or email Matt with any questions or comments.

A list of R packages for forestry applications

November 24, 2023
analytics R R packages statistics data science forestry

Recent updates to tidyverse functions

September 14, 2023
analytics data science R tidyverse

New book: Statistics in Natural Resources: Applications with R

July 20, 2022
analytics books education R statistics teaching statistics stats4nr