A student emailed this week with a data management need: how to turn thousands of rows of data to make a calculation based on two time periods? In short, the data were in a long format and needed to be wide.
The experiment was something like this: the student had data containing the number of tree seedlings in a forest, measured in eight different plots. She revisited the same forest five years later and measured the number of seedlings again.
The veg
dataset is formatted long and contains the “pre” (initial) and “post”" (five years later) measurements, along with the number of seedlings per acre:
## PlotID Period Seedlings
## 1 1 Pre 1200
## 2 1 Post 800
## 3 2 Pre 1250
## 4 2 Post 950
## 5 3 Pre 1350
## 6 3 Post 1200
## 7 4 Pre 1200
## 8 4 Post 650
## 9 5 Pre 1100
## 10 5 Post 950
## 11 6 Pre 1350
## 12 6 Post 900
## 13 7 Pre 1200
## 14 7 Post 650
## 15 8 Pre 1240
## 16 8 Post 910
So what happened during the five years? The number of tree seedlings generally decreased. Trees grew larger into the sapling class and many seedlings suffered mortality.
Now, how to visualize the change in tree seedlings between the two measurements? We can convert the veg
data to a wide format using the spread()
function from the tidyr
package. The result will turn the 16 rows of data into eight rows of data. In this new dataset called veg_wide
, seedlings will be stored in two columns: the pre and post measurements.
Now we can easily calculate a new variable that quantifies the change in the number of seedlings, called delta_Seedlings
:
veg_wide$delta_Seedlings<-veg_wide$Period_seedlings_Post - veg_wide$Period_seedlings_Pre
veg_wide
## PlotID Period_seedlings_Pre Period_seedlings_Post delta_Seedlings
## 1 1 1200 800 -400
## 2 2 1250 950 -300
## 3 3 1350 1200 -150
## 4 4 1200 650 -550
## 5 5 1100 950 -150
## 6 6 1350 900 -450
## 7 7 1200 650 -550
## 8 8 1240 910 -330
Now we can visualize the primary variable we’re interested in:
The data are now presented in a wide format. We can use the tidyr
package to convert the data back to a long format. Each plot should have two variables: Period and Seedlings. We’ll use the gather()
function to make this in a new data frame called veg_long
.
veg_long<-veg_wide %>% gather(key = Period, value = Seedlings,
Period_seedlings_Pre,Period_seedlings_Post)
veg_long %>% select(PlotID,Period,Seedlings)
## PlotID Period Seedlings
## 1 1 Period_seedlings_Pre 1200
## 2 2 Period_seedlings_Pre 1250
## 3 3 Period_seedlings_Pre 1350
## 4 4 Period_seedlings_Pre 1200
## 5 5 Period_seedlings_Pre 1100
## 6 6 Period_seedlings_Pre 1350
## 7 7 Period_seedlings_Pre 1200
## 8 8 Period_seedlings_Pre 1240
## 9 1 Period_seedlings_Post 800
## 10 2 Period_seedlings_Post 950
## 11 3 Period_seedlings_Post 1200
## 12 4 Period_seedlings_Post 650
## 13 5 Period_seedlings_Post 950
## 14 6 Period_seedlings_Post 900
## 15 7 Period_seedlings_Post 650
## 16 8 Period_seedlings_Post 910
Look familiar? The veg_long
and original veg
datasets are nearly identical. The spread()
and gather()
functions are two of many functions in R for organizing tidy data.
By Matt Russell. Leave a comment below or email Matt with any questions or comments.