Forget spreading and gathering your R data, try pivoting instead

August 8, 2020
analytics R tidyr pivot_long pivot_wide

Reshaping data from a wide to a long format, or vice versa, is one of the most common data manipulation tasks that analysts need to do. One of my first posts on this blog was on how to do this in R.

That post highlighted the spread() and gather() functions available from the tidyr package. As it turns out, the gather() and spread() functions are still available to use but are no longer under active development.

Two newer functions available in tidyr are pivot_longer() and pivot_wider(). These functions “lengthen” and “widen” a data set. According to the function documentation, these two functions are designed to be easier to use and can handle more use cases.

To examine how the functions work, say you had an experiment with the number of tree seedlings. These seedlings were initially measured in eight different plots. Five years later the same seedlings were measured again.

The veg data set is formatted long and contains the “pre” (initial) and “post”" (five years later) measurements, along with the number of seedlings per acre:

Table 1: The seedling dataset.
PlotIDPeriodSeedlings
1Pre1200
1Post800
2Pre1250
2Post950
3Pre1350
3Post1200
4Pre1200
4Post650
5Pre1100
5Post950
6Pre1350
6Post900
7Pre1200
7Post650
8Pre1240
8Post910

So what happened during the five years? The number of tree seedlings generally decreased. Trees grew larger into the sapling class and many seedlings suffered mortality. Here is a violin plot that shows the trend:

Now, how to visualize the change in tree seedlings between the two measurements? We can convert the veg data to a wide format using the pivot_wider() function from the tidyr package. The result will turn the 16 rows of data into eight rows of data. In this new data set called veg_wide, seedlings will be stored in two columns: the pre and post measurements:

veg_wide <- veg %>% 
  pivot_wider(names_from = Period, values_from = Seedlings)

In the names_from argument, you state which column(s) you want to provide in the new output column(s). In our example we create two new columns called Pre and Post that correspond to each of the seedling measurements. In the values_from statement, you state which column(s) to input the cell values into. In our example, are can compare the Pre and Post measurements now that they’re side-by-side:

Table 2: The seedling dataset in a WIDE format.
PlotIDPrePost
11200800
21250950
313501200
41200650
51100950
61350900
71200650
81240910

Now we can easily calculate a new variable that quantifies the change in the number of seedlings, called delta_Seedlings. We’ll use the mutate() function:

veg_wide <- veg_wide %>% 
  mutate(delta_Seedlings = Post - Pre)

Now we can visualize the primary variable we’re interested in:

The data are now presented in a wide format. Now, we can the pivot_longer() function to make this in a new data frame called veg_long.

veg_long <- veg_wide %>% 
  pivot_longer(-PlotID, names_to = "Period", values_to = "Seedlings") %>% 
  filter(Period != "delta_Seedlings") 

The - operator in the first argument indicates which column(s) you want to pivot (i.e., PlotID). The names_to argument names the new column based on the “wide” column names. We will name this new column Period which will collapse the Pre and Post variables. The values_to argument provides the name of the column to create that was stored in each of the cell values (i.e., the seedling measurements).

Note that the veg_long data set pivots all of the variables in a data set—in our example we suppressed the output of the delta_Seedlings variable using the filter() function from the dplyr package.

Here is how that data set looks:

Table 3: The seedling dataset in a LONG format.
PlotIDPeriodSeedlings
1Pre1200
1Post800
2Pre1250
2Post950
3Pre1350
3Post1200
4Pre1200
4Post650
5Pre1100
5Post950
6Pre1350
6Post900
7Pre1200
7Post650
8Pre1240
8Post910

Look familiar? The veg_long and original veg data sets are identical. Try out the pivot_wider() and pivot_longer() functions the next time you’re reshaping data. They are two of many functions in R for organizing tidy data.

By Matt Russell. Sign up for my monthly newsletter for in-depth analysis on data and analytics in the forest products industry.

Preliminary thoughts on using the new National Scale Volume and Biomass Estimators

February 26, 2024
forest inventory and analysis forest carbon forest inventory forest measurements NSVB FIA analytics

A list of R packages for forestry applications

November 24, 2023
analytics R R packages statistics data science forestry

How much does adding previous diameter and height growth change FVS predictions?

October 31, 2023
analytics carbon forest carbon forest inventory forest inventory and analysis forest measurements FVS growth and yield