#### Random forests: a tutorial with forestry data

##### September 27, 2021
analytics Data science random forests statistics Random forests have quickly become one of the most popular analytical techniques used in forestry today. Random forests (RF) are a machine learning technique that differ in many ways to traditional prediction models such as regression. Random forests can handle a lot of data, can be applied to classification or regression problems, and rank the relative importance of many variables that are related to a response variable of interest.

I’ve written about the theory behind random forests. This post will present a tutorial of using random forests in R.

## The Parresol tree biomass data

As an example, we’ll use a data set of 40 slash pine trees from Louisiana USA presented in Parresol’s 2001 paper Additivity of nonlinear biomass equations. The data are presented in Table 1 of the paper, which is replicated in this Google Sheet.

We’ll read in the data using the read_sheet() function from the googlesheets4 package. We will also load the tidyverse package to use some of its plotting features:

library(tidyverse)

tree <- read_sheet("https://docs.google.com/spreadsheets/d/1TPutUVyZLWr7XopKguT5Nvh9lo1EOG4wvOZ6_lD1F_M/edit?usp=sharing")

The data contain the following variables:

• TreeID: Tree observation record,
• DBH: Tree diameter at breast height, cm,
• HT: Tree height, m,
• LCL: Tree live crown length, m,
• Age: Age of the tree, years,
• Mass_wood: Green mass of the wood in the tree, kg,
• Mass_bark: Green mass of the bark in the tree, kg,
• Mass_crown: Green mass of the crown of the tree, kg, and
• Mass_tree: Green mass of all tree components, kg.

Our ultimate interest is in predicting the mass all tree components using common tree measurements such as tree diameter, height, live crown length, and age. Before we start modeling with the data, it is a good practice to first visualize the variables. The ggpairs() function from the GGally package is a useful tool that visualizes the distribution and correlation between variables:

library(GGally)

ggpairs(tree, columns = c(2:5, 9)) You can see a few variables have strong positive correlations with the mass of the tree (e.g., height and diameter) and some more moderate positive correlations (e.g., age).

## The randomForest R package

R and Python both have numerous packages that implement random forests. In R alone, there are nearly 400 packages with the word “tree” or “forest” in their name. (Sidebar: This is not ideal if you’re a forest analyst of biometrician because only 31 of them are actually about forestry.)

Breiman wrote about random forests in 2001 and a year later Liaw and Wiener created an R package that implements the technique. To date, the randomForest R package remains one of the most popular ones in machine learning.

We can install and load the randomForest package:

# install.packages("randomForest")
library(randomForest)

We will use the randomForest() function to predict total tree mass using several variables in the tree data set. A few other key statements to use in the randomForest() function are:

• keep.forest = T: This will save the random forest output, which will be helpful in summarizing the results.
• importance = TRUE: This will assess the importance of each of the predictors, essential output in random forests!
• mtry = 1: This tells the function to randomly sample one variable at each split in the random forest. For applications in regression, the default value is the number of predictor variables divided by three (and rounded down). In the modeling, several small samples of the entire data set are taken. Any observations that are not taken are called “out-of-bag” samples.
• ntree = 500. This tells the function to grow 500 trees. Generally, a larger number of trees will produce more stable estimates. However, increasing the number of trees needs to be done with consideration of time and memory issues when dealing with large data sets.

Our response variable in the random forests model is Mass_tree and predictors are DBH, HT, LCL, and Age.

tree.rf <- randomForest(Mass_tree ~ DBH + HT + LCL + Age,
data = tree,
keep.forest = T,
importance = TRUE,
mtry = 1,
ntree = 500)
tree.rf
##
## Call:
##  randomForest(formula = Mass_tree ~ DBH + HT + LCL + Age, data = tree,      keep.forest = T, importance = TRUE, mtry = 1, ntree = 500)
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
##
##           Mean of squared residuals: 8432.869
##                     % Var explained: 88.6

Note the mean of squared residuals and the percent variation explained (analogous to R-squared) provided in the output. (We’ll revisit them later.)

Another way to visualize the out-of-bag error rates of the random forests models is to use the plot() function. In this application, although we specified 500 trees, the out-of-bag error generally stabilizes after 100 trees:

plot(tree.rf) Some of the most helpful output in random forests is the importance of each of the predictor variables. The importance score is calculated by evaluating the regression tree with and without that variable. When evaluating the regression tree, the mean square error (MSE) will go up, down, or stay the same.

If the percent increase in MSE after removing the variable is large, it indicates an important variable. If the percent increase in MSE after removing the variable is small, it’s less important.

The importance() function prints the importance scores for each variable and the varImpPlot() function plots them:

importance(tree.rf)
##       %IncMSE IncNodePurity
## DBH 16.536030      821970.5
## HT  16.247151      861302.9
## LCL 14.062093      633208.3
## Age  8.155748      323918.3
varImpPlot(tree.rf,type=1) The output indicates that DBH is the most important variable for predicting Mass_tree and age the least important.

## Comparing random forests and regression models

Forest analysts are often compare multiple models and determine which one has a better predictive ability. In this case, we can fit a multiple linear regression model to the data and compare to the random forests model.

The lm() function can be used to develop a parametric model for Mass_tree:

tree.reg <- lm(Mass_tree ~ DBH + HT + LCL + Age, data = tree)
summary(tree.reg)
##
## Call:
## lm(formula = Mass_tree ~ DBH + HT + LCL + Age, data = tree)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -136.285  -57.177   -9.399   43.822  189.758
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -545.374     67.916  -8.030 1.89e-09 ***
## DBH           40.523      5.778   7.013 3.68e-08 ***
## HT           -15.048      8.079  -1.862   0.0709 .
## LCL            2.490     12.259   0.203   0.8402
## Age           15.431      3.198   4.825 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.33 on 35 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9106
## F-statistic: 100.4 on 4 and 35 DF,  p-value: < 2.2e-16

Note the residual standard error of 82.33 kg and the adjusted R-squared of 0.91. The residual standard error is slightly lower and the R-squared value slightly higher for the multiple regression model compared to the random forest output. In addition, further work may be conducted on the multiple regression model by removing the non-significant variables and refitting the model.

Another aspect of model evaluation is comparing predictions. Although random forests models are often considered a “black box” method because their results are not easily interpreted, the predict() function provides predictions of total tree mass:

Mass_pred_rf <- predict(tree.rf, tree, predict.all = F)
Mass_pred_reg <- predict(tree.reg, tree, predict.all = F)

In an ideal setting we might test our model on an independent data set not used in model fitting. However, we can combine the predicted tree weights from both models to the tree data set:

tree2 <- as.data.frame(cbind(tree, Mass_pred_rf, Mass_pred_reg))

Note that some predictions from the linear regression model on the 40 trees provide negative values for predicted total tree mass, an undesirable feature that may need to be addressed before implementing the model:

tree2 %>%
summarize(Mass_tree, Mass_pred_rf, Mass_pred_reg)
##    Mass_tree Mass_pred_rf Mass_pred_reg
## 1        9.8     25.78337   -108.051811
## 2       12.1     25.13321    -86.903051
## 3       24.4     46.33957    -62.814807
## 4       27.0     35.07399    -42.513067
## 5       33.6     42.35590     -7.391764
## 6       43.5     44.74030     -8.814627
## 7       46.0    105.06471    168.354603
## 8       56.1     79.76736     -8.073626
## 9       64.4     66.07957     47.563293
## 10      70.8    103.35683     64.024945
## 11      75.9    114.75971    189.278688
## 12      88.7     84.72086    103.439719
## 13      95.7    102.82287     18.126885
## 14     102.4    146.06540    238.684823
## 15     123.7    138.20651     90.880242
## 16     147.6    177.44083    261.258307
## 17     148.5    145.70510    174.276553
## 18     174.8    165.93732    186.750002
## 19     193.0    164.60443    199.939638
## 20     211.7    204.88033    306.293014
## 21     214.6    272.80811    186.964881
## 22     225.3    245.85032    240.537957
## 23     244.7    245.55750    277.932654
## 24     258.2    292.13384    263.034375
## 25     285.8    275.14268    363.444771
## 26     297.6    284.67062    317.816460
## 27     309.8    269.27449    366.051168
## 28     316.2    341.63584    392.605018
## 29     318.0    320.49226    283.934957
## 30     401.1    417.38922    399.000959
## 31     402.2    394.79413    463.875242
## 32     411.9    406.38153    450.015697
## 33     446.3    467.69703    458.158909
## 34     490.3    433.95661    546.871939
## 35     522.6    515.31412    583.516204
## 36     522.7    466.91310    519.012527
## 37     593.6    560.88703    652.592720
## 38     900.3    811.64060    714.152257
## 39    1034.9    910.49999    845.142484
## 40    1198.5    990.69244   1095.330861

We may also be interested in plotting residual values from both model types to compare their performance:

p.rf <- ggplot(tree2, (aes(x = Mass_pred_rf, y = Mass_tree - Mass_pred_rf))) +
geom_point() +
scale_y_continuous(limits = c(-200, 200)) +
labs(x = "Predicted tree mass (kg)",
y = "Residual (kg)",
subtitle = "Random forests model")

p.reg <- ggplot(tree2, (aes(x = Mass_pred_reg, y = Mass_tree - Mass_pred_reg))) +
geom_point() +
scale_y_continuous(limits = c(-200, 200)) +
labs(x = "Predicted tree mass (kg)",
y = "Residual (kg)",
subtitle = "Regression model")

library(patchwork)

p.rf + p.reg
## Warning: Removed 1 rows containing missing values (geom_point). With the heteroscedastic residuals in the models, we’d likely want to explore transforming the data prior to model fitting, or to explore other modeling techniques.

## Summary

Random forests techniques are flexible and can perform comparably with other regression or classification methods. Random forests can handle all types of data (e.g., categorical, continuous) and are advantageous because they work well with data sets containing a large number of predictor variables. The randomForest package has seen a lot of development and can be used to help solve modeling problems in your future forest analytics work.

By Matt Russell. Email Matt with any questions or comments. Sign up for my monthly newsletter for in-depth analysis on data and analytics in the forest products industry.

#### The importance of citizen data scientists in forest analytics

##### September 7, 2021
analytics citizen data science careers Data science forestry statistics

#### Forest carbon: key definitions and numbers in perspective

##### August 8, 2021
analytics carbon carbon markets forest carbon

#### Drought in the US West: impacts from the last ten years

##### July 23, 2021
analytics climate forest inventory drought