Quickly explore your timber cruise data with six lines of code

The inspectdf package in R allows you to quickly summarize timber cruise data by species, tree status, and other variables.
Data viz timber cruise forest inventory inspectdf Communicating data

Data from timber cruises and other forest inventories need to be clean and easy to interpret. Time can be lost in the field and office if data were collected sloppily, were measured in the field without a protocol, or are otherwise “messy” with lots of errors.

A data-checking routine can save time, money, and resources. Identifying potential data inaccuracies or inconsistencies soon after they’re collected can help eliminate any inappropriate decisions that are made from those data further down the line.

Don’t have a data-checking routine in place? Data-checking routines can ensure quality and reliability in forest inventories and contribute to quality assurance and quality control.

Data-checking routines can also instill confidence in field crew members that are cruising timber. The job performance of forest analysts is due in part to the quality of data they’re working with. Having forest inventory data that everyone in a company can trust is essential for basing multi-million dollar decisions.

An R package for quickly exploring data sets

Any good data checking routine would have a number of attributes:

  • it loads data quickly and efficiently,
  • it provides a useful numerical summary of important tree and stand attributes, and
  • it provides useful visualizations for exploring timber cruise data.

Fortunately the inspectdf package developed by Alastair Rushworth provides an easy-to-use tool that allows an analyst to inspect, compare, and visualize data frames in R.

The inspectdf package can be loaded either from CRAN or from Github through the install_github() function in the devtools package. It is also useful to load the tidyverse package to accompany the series of graphs that the inspectdf package relies on:

# devtools::install_github("alastairrushworth/inspectdf")
library(inspectdf)
library(tidyverse)

The Cloquet timber cruise data

In 2014 a field crew remeasured 389 fixed-radius permanent sample plots at the Cloquet Forestry Center in Cloquet, Minnesota. Trees greater than 5.0 inches diameter at breast height (DBH) were measured on a 1/7th acre fixed-radius plot and and trees smaller than 5.0 inches DBH were measured on a 1/100th acre plot. The data were comprised of 14,947 individual tree observations.

The Cloquet data set (tree) includes two numerical variables:

  • tree diameter at breast height (DBH), and
  • total tree height in feet (HT)

Also included are five categorical variables:

  • the primary cover species where the plot is located (CoverType),
  • the status of the tree (e.g., growth tree/standing dead tree; TreeClass),
  • the crown class of the tree (e.g., dominant/overtopped; CrownClass),
  • the crown ratio of the tree, categorized in ten percent increments (CrownRatio), and
  • a two-letter code indicating tree species (SPP)

The data include a variety of species in various in a variety of conditions:

Table 1: The Cloquet timber cruise data.
CoverType DBH TreeClass CrownClass CrownRatio HT SPP
Red pine 19.8 Standing dead Dominant 21-30% 80 WP
White pine 12.9 Growth tree Dominant 11-20% NA WP
Nonforest 12.0 Harvested Codominant 71-80% NA WP
Red pine 11.6 Growth tree Codominant 31-40% 78 WP
Upland mixed hardwoods 5.3 Ingrowth tree Codominant 51-60% NA WP
Red pine 19.3 Standing dead Codominant 31-40% 95 WP

The inspectdf functions

The inspect_types() function allows you to inspect the column types for each variable in the data set. In the Cloquet data set it reads the two numeric variables and five categorical variables as factors:

inspect_types(tree) %>% show_plot()

The inspect_na() function summarizes each column and presents the percentage of observations that are missing (i.e., ones that have NA values). In the Cloquet data set the only variable with missing data is HT.

In total, 80% of tree observations do not contain a HT measurement. This agrees with the plot measurement protocol of measuring every fifth tree for height as a part of a subsample:

inspect_na(tree) %>% show_plot()

The inspect_num() function provides a histogram of the distribution of all numeric columns in the data. The function allows you to quickly spot outliers and data entry errors in the data. In the Cloquet data set, DBH peaks around eight inches and HT around 50 feet:

inspect_num(tree) %>% show_plot()

The inspect_cor() function provides the Pearson correlation coefficient for all numeric variables in the data set. In the Cloquet data set, we see that DBH and HT have a correlation coefficient of 0.772, which is what we would expect after inspecting the HT-DBH scatter plot shown previously:

inspect_cor(tree)
## # A tibble: 1 x 6
##   col_1 col_2  corr p_value lower upper
##   <chr> <chr> <dbl>   <dbl> <dbl> <dbl>
## 1 HT    DBH   0.772       0 0.757 0.787

The inspect_cat() function provides a graph showing the frequency of categorical levels in the data using a stacked bar chart. We can see that red pine (RP) and aspen (QA) are the most common cover types and species, most trees are in a codominant crown position, crown ratio is typically between 21 and 40%, and most observations are on growth trees (those that were measured in past inventories) or ingrowth trees:

inspect_cat(tree) %>% show_plot() 

For large timber cruise data sets, or for data sets when memory limits pose a problem, the inspect_mem() function may be useful. The function provides a graph showing the total size of the data set and the contribution of each column to its total volume. The Cloquet data are 532.25 Kb, and the numeric variables DBH and HT occupy the most space in the data (116.24 Kb each):

inspect_mem(tree) %>% show_plot()

Summary

The inspectdf package is useful to quickly analyze any timber cruise data set. It provides simple numeric output, produces easy-to-interpret visualizations, and requires a minimal amount of code through its functions.

All of these reasons are advantages to rapidly assess timber cruise data quality. Incorporating data quality routines in your everyday work flow can provide confidence and increase the reliability in forest inventory data.

A 50-year history of forest measurements by analyzing Forest Measurements

September 15, 2019
forest inventory sampling books forest measurements

Should foresters keep saying “statistically significant”?

August 30, 2019
statitics significance p-values Communicating data

Using Google Forms to analyze confidence intervals: a teaching statistics activity

July 14, 2019
Data viz teaching statistics moose overconfidence
comments powered by Disqus