Forget spreading and gathering your R data, try pivoting instead

August 8, 2020

analytics R tidyr pivot_long pivot_wide

Reshaping data from a wide to a long format, or vice versa, is one of the most common data manipulation tasks that analysts need to do. One of my first posts on this blog was on how to do this in R.

That post highlighted the spread() and gather() functions available from the tidyr package. As it turns out, the gather() and spread() functions are still available to use but are no longer under active development.

Two newer functions available in tidyr are pivot_longer() and pivot_wider(). These functions “lengthen” and “widen” a data set. According to the function documentation, these two functions are designed to be easier to use and can handle more use cases.

To examine how the functions work, say you had an experiment with the number of tree seedlings. These seedlings were initially measured in eight different plots. Five years later the same seedlings were measured again.

The veg data set is formatted long and contains the “pre” (initial) and “post”" (five years later) measurements, along with the number of seedlings per acre:

Table 1: The seedling dataset.
PlotID	Period	Seedlings
1	Pre	1200
1	Post	800
2	Pre	1250
2	Post	950
3	Pre	1350
3	Post	1200
4	Pre	1200
4	Post	650
5	Pre	1100
5	Post	950
6	Pre	1350
6	Post	900
7	Pre	1200
7	Post	650
8	Pre	1240
8	Post	910

So what happened during the five years? The number of tree seedlings generally decreased. Trees grew larger into the sapling class and many seedlings suffered mortality. Here is a violin plot that shows the trend:

Now, how to visualize the change in tree seedlings between the two measurements? We can convert the veg data to a wide format using the pivot_wider() function from the tidyr package. The result will turn the 16 rows of data into eight rows of data. In this new data set called veg_wide, seedlings will be stored in two columns: the pre and post measurements:

veg_wide <- veg %>% 
  pivot_wider(names_from = Period, values_from = Seedlings)

In the names_from argument, you state which column(s) you want to provide in the new output column(s). In our example we create two new columns called Pre and Post that correspond to each of the seedling measurements. In the values_from statement, you state which column(s) to input the cell values into. In our example, are can compare the Pre and Post measurements now that they’re side-by-side:

Table 2: The seedling dataset in a WIDE format.
PlotID	Pre	Post
1	1200	800
2	1250	950
3	1350	1200
4	1200	650
5	1100	950
6	1350	900
7	1200	650
8	1240	910

Now we can easily calculate a new variable that quantifies the change in the number of seedlings, called delta_Seedlings. We’ll use the mutate() function:

veg_wide <- veg_wide %>% 
  mutate(delta_Seedlings = Post - Pre)

Now we can visualize the primary variable we’re interested in:

The data are now presented in a wide format. Now, we can the pivot_longer() function to make this in a new data frame called veg_long.

veg_long <- veg_wide %>% 
  pivot_longer(-PlotID, names_to = "Period", values_to = "Seedlings") %>% 
  filter(Period != "delta_Seedlings")

The - operator in the first argument indicates which column(s) you want to pivot (i.e., PlotID). The names_to argument names the new column based on the “wide” column names. We will name this new column Period which will collapse the Pre and Post variables. The values_to argument provides the name of the column to create that was stored in each of the cell values (i.e., the seedling measurements).

Note that the veg_long data set pivots all of the variables in a data set—in our example we suppressed the output of the delta_Seedlings variable using the filter() function from the dplyr package.

Here is how that data set looks:

Table 3: The seedling dataset in a LONG format.
PlotID	Period	Seedlings
1	Pre	1200
1	Post	800
2	Pre	1250
2	Post	950
3	Pre	1350
3	Post	1200
4	Pre	1200
4	Post	650
5	Pre	1100
5	Post	950
6	Pre	1350
6	Post	900
7	Pre	1200
7	Post	650
8	Pre	1240
8	Post	910

Look familiar? The veg_long and original veg data sets are identical. Try out the pivot_wider() and pivot_longer() functions the next time you’re reshaping data. They are two of many functions in R for organizing tidy data.

–

By Matt Russell. Sign up for my monthly newsletter for in-depth analysis on data and analytics in the forest products industry.