Spreading and gathering tree data

March 9, 2019
R tidyr gather() spread()

A student emailed this week with a data management need: how to turn thousands of rows of data to make a calculation based on two time periods? In short, the data were in a long format and needed to be wide.

The experiment was something like this: the student had data containing the number of tree seedlings in a forest, measured in eight different plots. She revisited the same forest five years later and measured the number of seedlings again.

The veg dataset is formatted long and contains the “pre” (initial) and “post”" (five years later) measurements, along with the number of seedlings per acre:

##    PlotID Period Seedlings
## 1       1    Pre      1200
## 2       1   Post       800
## 3       2    Pre      1250
## 4       2   Post       950
## 5       3    Pre      1350
## 6       3   Post      1200
## 7       4    Pre      1200
## 8       4   Post       650
## 9       5    Pre      1100
## 10      5   Post       950
## 11      6    Pre      1350
## 12      6   Post       900
## 13      7    Pre      1200
## 14      7   Post       650
## 15      8    Pre      1240
## 16      8   Post       910

So what happened during the five years? The number of tree seedlings generally decreased. Trees grew larger into the sapling class and many seedlings suffered mortality.

Now, how to visualize the change in tree seedlings between the two measurements? We can convert the veg data to a wide format using the spread() function from the tidyr package. The result will turn the 16 rows of data into eight rows of data. In this new dataset called veg_wide, seedlings will be stored in two columns: the pre and post measurements.

Now we can easily calculate a new variable that quantifies the change in the number of seedlings, called delta_Seedlings:

veg_wide$delta_Seedlings<-veg_wide$Period_seedlings_Post - veg_wide$Period_seedlings_Pre
veg_wide
##   PlotID Period_seedlings_Pre Period_seedlings_Post delta_Seedlings
## 1      1                 1200                   800            -400
## 2      2                 1250                   950            -300
## 3      3                 1350                  1200            -150
## 4      4                 1200                   650            -550
## 5      5                 1100                   950            -150
## 6      6                 1350                   900            -450
## 7      7                 1200                   650            -550
## 8      8                 1240                   910            -330

Now we can visualize the primary variable we’re interested in:

The data are now presented in a wide format. We can use the tidyr package to convert the data back to a long format. Each plot should have two variables: Period and Seedlings. We’ll use the gather() function to make this in a new data frame called veg_long.

veg_long<-veg_wide %>% gather(key = Period, value = Seedlings,
       Period_seedlings_Pre,Period_seedlings_Post)
veg_long %>% select(PlotID,Period,Seedlings)
##    PlotID                Period Seedlings
## 1       1  Period_seedlings_Pre      1200
## 2       2  Period_seedlings_Pre      1250
## 3       3  Period_seedlings_Pre      1350
## 4       4  Period_seedlings_Pre      1200
## 5       5  Period_seedlings_Pre      1100
## 6       6  Period_seedlings_Pre      1350
## 7       7  Period_seedlings_Pre      1200
## 8       8  Period_seedlings_Pre      1240
## 9       1 Period_seedlings_Post       800
## 10      2 Period_seedlings_Post       950
## 11      3 Period_seedlings_Post      1200
## 12      4 Period_seedlings_Post       650
## 13      5 Period_seedlings_Post       950
## 14      6 Period_seedlings_Post       900
## 15      7 Period_seedlings_Post       650
## 16      8 Period_seedlings_Post       910

Look familiar? The veg_long and original veg datasets are nearly identical. The spread() and gather() functions are two of many functions in R for organizing tidy data.

By Matt Russell. Leave a comment below or email Matt with any questions or comments.

A list of R packages for forestry applications

November 24, 2023
analytics R R packages statistics data science forestry

Recent updates to tidyverse functions

September 14, 2023
analytics data science R tidyverse

New book: Statistics in Natural Resources: Applications with R

July 20, 2022
analytics books education R statistics teaching statistics stats4nr