#### Modeling in the tidyverse, an example predicting tree biomass

##### March 24, 2020

`statistics`

`Regression`

`tidyverse`

`broom`

`modelr`

One of the most revolutionary advances in the last 10 years in the R program has been the tidyverse. I recently watched Hadley Wickham’s talk on the tidyverse from rstudio::conf 2020 and it was a great synopsis of the current status and future of the packages.

The growth of the tidyverse packages is astounding. As an example, the cumulative number of downloads for the `dplyr`

package has reached over 10 million. Incredible!

The tidyverse offers a suite of packages and functions to import, tidy, transform, visualize, and model data. I’ve known well how to use the data wrangling functions from packages like `dplyr`

and data visualization from `ggplot2`

. It’s been that last core function of the tidyverse, modeling data, that I’ve never really realized.

The `modelr`

and `broom`

packages handle modeling in the tidyverse. In this post, I’ll also use the `knitr`

and `kableExtra`

to design some of the tables:

```
library(tidyverse)
library(modelr)
library(knitr)
library(kableExtra)
```

## Tree biomass data

My goal in this post is to fit a series of models that determine the aboveground biomass of trees using tree diameter as a predictor variable. I’ve gathered data from LegacyTreeData, an online repository of individual tree measurements such as volume, weight, and wood density. I queried the database to provide all tree measurements from the US State of Georgia.

After the query, there are 608 observations from seven species that contain a value for the tree’s diameter at breast height, measured in inches (`ST_OB_D_BH`

) and its aboveground dry weight in pounds (`AG_DW`

).

Here’s a “hexagonal heat map” produced in `ggplot`

that divides the x- and y-axes into 20 hexagons. The color of each hexagon reflects the number of observations in each area of the figure. In this data set, most trees are small in diameter and do not weigh a lot:

```
ggplot(tree, aes(ST_OB_D_BH, AG_DW)) +
geom_hex(bins = 20) +
labs(x = "Diameter at breast height (inches)",
y = "Aboveground dry weight (pounds)") +
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
```

The data contain relatively small trees, with DBH ranging values from 0.7 to 8.5 inches. There are at least 40 observations for seven primary species common to Georgia. These are mostly different kinds of pine trees in addition to one hardwood tree (sweetgum):

Species | Num trees | Mean DBH | Max DBH | Min DBH | Mean weight | Max weight | Min weight |
---|---|---|---|---|---|---|---|

Loblolly pine | 186 | 3.1 | 8.5 | 0.7 | 37.1 | 422.8 | 1.7 |

Shortleaf pine | 100 | 3.0 | 4.9 | 1.1 | 26.3 | 78.5 | 1.7 |

Longleaf pine | 80 | 3.0 | 4.9 | 1.2 | 36.5 | 119.1 | 1.9 |

Slash pine | 80 | 3.0 | 4.9 | 1.2 | 30.6 | 105.8 | 1.9 |

Virginia pine | 80 | 3.0 | 4.9 | 1.0 | 37.7 | 128.1 | 2.1 |

Sweetgum | 42 | 1.8 | 3.3 | 1.0 | 7.8 | 26.5 | 1.5 |

Eastern white pine | 40 | 3.0 | 4.9 | 1.0 | 27.5 | 67.7 | 2.1 |

## Models of aboveground tree biomass

To start our modeling analysis, we can fit a model predicting aboveground dry weight using tree diameter. From above, we can see a clear nonlinear trend in the data, indicating that we may want to log-transform the variables. We can do that with the `mutate`

function.

Then, we can start by fitting a linear model using `lm()`

to all observations in the data set. Printing the summary of `tree_mod`

indicates that the model fits the data well, with small standard errors of the coefficients and an R-squared value of 0.9251:

```
tree <- tree %>%
mutate(l_ST_OB_D_BH = log2(ST_OB_D_BH),
l_AG_DW = log2(AG_DW))
tree_mod <- lm(l_AG_DW ~ l_ST_OB_D_BH, data = tree)
summary(tree_mod)
```

```
##
## Call:
## lm(formula = l_AG_DW ~ l_ST_OB_D_BH, data = tree)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21620 -0.27276 -0.03442 0.22539 2.60255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.02268 0.04085 25.04 <2e-16 ***
## l_ST_OB_D_BH 2.26555 0.02619 86.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4428 on 606 degrees of freedom
## Multiple R-squared: 0.9251, Adjusted R-squared: 0.925
## F-statistic: 7485 on 1 and 606 DF, p-value: < 2.2e-16
```

Chapter 24 in the popular book R for Data Science discusses making a grid of data to investigate model predictions. In this step, I’ll use the `data_grid()`

function to generate a grid of data points. Then, I’ll use the `add_predictions()`

function to add the model predictions from `tree_mod`

to complete our data grid. Finally, I’ll back-transform the model predictions to get them into units we understand well.

The model appears to perform well, albeit underpredicting the biomass of a few larger-diameter loblolly pine trees:

```
grid <- tree %>%
data_grid(ST_OB_D_BH = seq_range(ST_OB_D_BH, 20)) %>%
mutate(l_ST_OB_D_BH = log2(ST_OB_D_BH)) %>%
add_predictions(tree_mod, "l_AG_DW") %>%
mutate(AG_DW = 2 ^ l_AG_DW)
ggplot(tree, aes(ST_OB_D_BH, AG_DW)) +
geom_hex(bins = 20) +
geom_line(data = grid, color = "red", size = 1) +
labs(x = "Diameter at breast height (inches)",
y = "Aboveground dry weight (pounds)")
```

A data analyst’s best friend in the tidyverse is the `group_by`

statement. We can fit the same model as we did earlier, but this time I’ll specify it for each of the seven species using `group_by`

. The `tidy()`

function available in the broom package provides a set of functions that put model output into data frames.

Here, we can see that the species have a different set of coefficients and other attributes like p-values:

```
tree_coef<- tree %>%
group_by(Species) %>%
do(broom::tidy(lm(l_AG_DW ~ l_ST_OB_D_BH, .)))
tree_coef
```

```
## # A tibble: 14 x 6
## # Groups: Species [7]
## Species term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Eastern white pine (Intercept) 1.02 0.109 9.40 1.88e-11
## 2 Eastern white pine l_ST_OB_D_BH 2.21 0.0693 31.9 4.86e-29
## 3 Loblolly pine (Intercept) 1.53 0.0775 19.8 2.45e-47
## 4 Loblolly pine l_ST_OB_D_BH 1.96 0.0475 41.3 5.82e-95
## 5 Longleaf pine (Intercept) 0.445 0.103 4.33 4.36e- 5
## 6 Longleaf pine l_ST_OB_D_BH 2.73 0.0652 41.8 3.46e-55
## 7 Shortleaf pine (Intercept) 0.600 0.0683 8.80 4.95e-14
## 8 Shortleaf pine l_ST_OB_D_BH 2.40 0.0432 55.4 8.62e-76
## 9 Slash pine (Intercept) 0.532 0.0962 5.54 4.00e- 7
## 10 Slash pine l_ST_OB_D_BH 2.54 0.0610 41.6 5.33e-55
## 11 Sweetgum (Intercept) 0.749 0.0740 10.1 1.34e-12
## 12 Sweetgum l_ST_OB_D_BH 2.38 0.0834 28.5 3.42e-28
## 13 Virginia pine (Intercept) 1.12 0.0766 14.6 5.04e-24
## 14 Virginia pine l_ST_OB_D_BH 2.38 0.0480 49.6 9.77e-61
```

A way to visualize the species differences is to plot the intercept and slope coefficients with standard errors. Here we can see that all errors bars do not overlap with zero, indicating they’re good models:

```
ggplot(tree_coef, aes(estimate, 1)) +
geom_point() +
geom_errorbarh(aes(xmin = estimate - (2*std.error),
xmax = estimate + (2*std.error),
height = 0.5)) +
scale_y_continuous(limits = c(0,2))+
facet_grid(term~Species)+
labs(x = "Coefficient", y = " ")+
ggtitle("Coefficients for determining aboveground dry weight \n(+/- two standard errors) for seven tree species")+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
```

## Analysis of model predictions

Aside from coefficients, we might be interested in species-specific predictions from a model. The `nest()`

function creates a list of data frames containing all the nested variables in an object. I think of a nested data frame as a “data frame containing many data frames”.

The `by_spp`

object will store this list of data frames for each species so that we can work with them:

```
by_spp <- tree %>%
group_by(Species) %>%
nest()
species_model <- function(df){
lm(l_AG_DW ~ l_ST_OB_D_BH, data = df)
}
models <- map(by_spp$data, species_model)
by_spp <- by_spp %>%
mutate(model = map(data, species_model))
by_spp
```

```
## # A tibble: 7 x 3
## # Groups: Species [7]
## Species data model
## <chr> <list<df[,7]>> <list>
## 1 Sweetgum [42 x 7] <lm>
## 2 Loblolly pine [186 x 7] <lm>
## 3 Shortleaf pine [100 x 7] <lm>
## 4 Virginia pine [80 x 7] <lm>
## 5 Eastern white pine [40 x 7] <lm>
## 6 Slash pine [80 x 7] <lm>
## 7 Longleaf pine [80 x 7] <lm>
```

Similar to what we did above to the all-species equation, we can map the model predictions to the nested object, adding another variable called `preds`

:

```
by_spp <- by_spp %>%
mutate(preds = map2(data, model, add_predictions))
by_spp
```

```
## # A tibble: 7 x 4
## # Groups: Species [7]
## Species data model preds
## <chr> <list<df[,7]>> <list> <list>
## 1 Sweetgum [42 x 7] <lm> <tibble [42 x 8]>
## 2 Loblolly pine [186 x 7] <lm> <tibble [186 x 8]>
## 3 Shortleaf pine [100 x 7] <lm> <tibble [100 x 8]>
## 4 Virginia pine [80 x 7] <lm> <tibble [80 x 8]>
## 5 Eastern white pine [40 x 7] <lm> <tibble [40 x 8]>
## 6 Slash pine [80 x 7] <lm> <tibble [80 x 8]>
## 7 Longleaf pine [80 x 7] <lm> <tibble [80 x 8]>
```

Then, we can unnest the model predictions. Unnesting is the opposite of what we’ve done in the previous step. This time we’re taking the nested data frame and turning it into a “regular” data frame.

The model predictions can now be visualized by species to better understand differences in aboveground biomass:

```
preds <- unnest(by_spp, preds)
preds %>%
mutate(tpred = 2 ^ pred) %>%
ggplot(aes(ST_OB_D_BH, tpred)) +
geom_line(aes(group = Species), size = 1) +
labs(x = "Diameter at breast height (inches)",
y = "Predicted aboveground\ndry weight (pounds)")+
facet_wrap(~ Species)
```

## Analysis of model residuals

Any good data analysis involving modeling will also analyze the residuals. Just like we added model predictions with the `add_predictions`

function and nesting/unnesting, we can add residuals with the `add_residuals`

statement.

We see that the model residuals are generally centered around zero, with those four bigger loblolly pine trees still giving us some issues:

```
by_spp <- by_spp %>%
mutate(resids = map2(data, model, add_residuals))
by_spp
```

```
## # A tibble: 7 x 5
## # Groups: Species [7]
## Species data model preds resids
## <chr> <list<df[,7]>> <list> <list> <list>
## 1 Sweetgum [42 x 7] <lm> <tibble [42 x 8~ <tibble [42 x 8~
## 2 Loblolly pine [186 x 7] <lm> <tibble [186 x ~ <tibble [186 x ~
## 3 Shortleaf pine [100 x 7] <lm> <tibble [100 x ~ <tibble [100 x ~
## 4 Virginia pine [80 x 7] <lm> <tibble [80 x 8~ <tibble [80 x 8~
## 5 Eastern white pi~ [40 x 7] <lm> <tibble [40 x 8~ <tibble [40 x 8~
## 6 Slash pine [80 x 7] <lm> <tibble [80 x 8~ <tibble [80 x 8~
## 7 Longleaf pine [80 x 7] <lm> <tibble [80 x 8~ <tibble [80 x 8~
```

```
resids <- unnest(by_spp, resids)
resids %>%
ggplot(aes(ST_OB_D_BH, resid)) +
geom_point(aes(group = Species), alpha = 1/3) +
geom_smooth(se = F) +
labs(x = "Diameter at breast height (inches)",
y = "Residual")+
geom_abline(intercept = 0, slope = 0, color = "pink")+
facet_wrap(~Species)
```

## Assessing model quality

We also might be interested in other model attributes such as the R-squared, log-likelihood, AIC, and BIC values. These indicators of model fit can be useful to compare with different model forms. The `glance()`

function from the `broom`

package allows you to obtain a data frame with a single row.

```
glance <- by_spp %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = T)
glance %>%
arrange(desc(r.squared))
```

```
## # A tibble: 7 x 16
## # Groups: Species [7]
## Species data model preds resids r.squared adj.r.squared sigma
## <chr> <list<df> <lis> <lis> <list> <dbl> <dbl> <dbl>
## 1 Virgin~ [80 x 7] <lm> <tib~ <tibb~ 0.969 0.969 0.268
## 2 Shortl~ [100 x 7] <lm> <tib~ <tibb~ 0.969 0.969 0.266
## 3 Easter~ [40 x 7] <lm> <tib~ <tibb~ 0.964 0.963 0.288
## 4 Longle~ [80 x 7] <lm> <tib~ <tibb~ 0.957 0.957 0.350
## 5 Slash ~ [80 x 7] <lm> <tib~ <tibb~ 0.957 0.956 0.327
## 6 Sweetg~ [42 x 7] <lm> <tib~ <tibb~ 0.953 0.952 0.245
## 7 Loblol~ [186 x 7] <lm> <tib~ <tibb~ 0.903 0.902 0.509
## # ... with 8 more variables: statistic <dbl>, p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
```

We can visualize the R-squared values from the seven species and see that Virginia pine and shortleaf pine show the highest R-squared, and loblolly pine the lowest:

```
glance %>%
ggplot(aes(reorder(Species, r.squared), r.squared)) +
geom_bar(stat = "identity")+
coord_flip()+
geom_smooth(se = F) +
labs(x = " ",
y = "R-squared")+
theme(legend.position = "none")
```

## Conclusion

Like other data analysts, I’ve spent a lot of time with the tidyverse doing data wrangling and data visualization. But don’t underestimate the power of the modeling functions available in the tidyverse. The suite of functions available in the `modelr`

and `broom`

packages make modeling easier, particularly if the same model forms need to be fit iteratively to different levels (in our case, tree species).

*By Matt Russell. Leave a comment below or email Matt with any questions or comments.*

#### Related

#### Random forests: a tutorial with forestry data

##### September 27, 2021

`analytics`

`Data science`

`random forests`

`statistics`