Four ways to supercharge your scatterplots

February 3, 2020
analytics Communicating data Data viz ggplot2 scatterplots

Scatterplots are a go-to visualization with forestry data. Scatterplots show the relationship between two quantitative variables measured on the same object. For example, a tree’s diameter and height.

There are a few rules for all scatterplots:

Without doing any statistics and deep analytics, scatterplots can reveal a linear or nonlinear pattern, show the direction and strength of a relationship, and expose any outliers in the data.

Two variables are positively associated with one another when above-average values of one are observed with above-average values of the other. Two variables are negatively associated when above-average values of one are found with below-average values of the other.

The following example shows the diameter at breast height (DBH; in inches) and height (HT; in feet) of four species of trees growing in Cloquet, Minnesota:

Table 1: A subset of the Cloquet tree data.
DBH HT Species
1.5 10 Balsam fir
1.3 10 Quaking aspen
2.3 22 Red pine
6.0 44 Red pine
8.3 61 Quaking aspen
6.9 62 Red pine
10.7 90 Quaking aspen

A scatterplot can be drawn with the 1,034 tree observations of DBH-HT pairs for the four most common species in the data. The ggplot() function along with geom_point() is a great way to start using the tidyverse suite of functions in R:

# Create the original scatterplot

ggplot(tree, aes(DBH, HT)) +
  geom_point() +
  labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
  theme(panel.background = element_rect(fill = "NA"),
        axis.line = element_line(color = "black"))

In many situations, this is where the visualization ends. However, you can do more with a scatterplot with a few variations in your plots. This post describes four tips to “supercharge” your scatterplots using ggplot() in R to get more insights from your data.

Four ways to supercharge your scatterplots

1. Add color.

Best for: Showing the range of values within different levels of a categorical variable.

Adding color is an effective way to reveal more of what’s behind the data. It works well when the data contain a categorical variable, the number of levels in that variable are not too large (say, less than six), and there may be different distributions between the levels in a variable.

As an example, we can add tree species as a mapping variable in the aes() statement to add color to our scatterplot. We see that red pine trees are some of the tallest and have the largest diameters:

# Create scatterplot with color showing species
ggplot(tree, aes(DBH, HT, col = Species)) +
  geom_point() +
  labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
  theme(panel.background = element_rect(fill = "NA"),
        axis.line = element_line(color = "black"))

2. Add trend lines.

3. Facet it.

Best for: Splitting up the scatterplot to show the range and number of observations within different levels of a categorical variable.

Another way to easily see the differences in ranges of two continuous variables in a scatterplot is to plot each level of a categorical variable in its own panel. The facet_wrap() function allows you to do this.

In this case we easily see that red pine trees have a full range of DBH-HT, while the other species have a narrower range:

# Create scatterplot with trend line
ggplot(tree, aes(DBH, HT, col = Species)) +
  geom_point() +
  facet_wrap(~Species)+
  labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
  theme(panel.background = element_rect(fill = "NA"),
        axis.line = element_line(color="black"))

The facet_wrap() function works well when you have a single categorical variable to facet. The facet_grid() function allows you to plot two categorical variables simultaneously.

We don’t have a great example with the Cloquet tree data set that could serve as a second categorical variable. But you could imagine that if we had a tree’s crown class, we could plot the four species vertically and four crown classes horizontally (dominant, co-dominant, intermediate, and suppressed).

4. Hex it.

Best for: Showing the number of observations in a “busy” area of the scatterplot.

A “hexagonal heat map” can be produced in ggplot that divides the x- and y-axes into hexagons, and the color of that hexagon reflects the number of observations in each hexagon. The geom_hex() function fills in the number of observations within each hexagon.

Here’s an example with the number of bins along the x- and y-axis set to 25:

# Create scatterplot with trend line
ggplot(tree, aes(DBH, HT)) +
  geom_hex(bins = 25) +
  labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
  theme(panel.background = element_rect(fill = "NA"),
        axis.line = element_line(color = "black"))

The hexagonal scatterplot shows that most of the observations in the Cloquet tree data set are less than 12 inches in diameter and are shorter than 60 feet tall. In the original scatterplot, due to overlapping points in “busy” areas of the graph, this finding can’t really be observed. Knowing that this “clustering” exists can be insightful for future data analysis.

The number of bins in geom_hex() can be increased to see a finer resolution (with fewer observations grouped into each hexagon). Or it can can be decreased to see a coarser resolution (with more observations grouped into each hexagon).

Conclusion

Scatterplots are some of the first visualizations we make when we begin to analyze data. With a few slight modifications and additions to our ggplot code, we can draw more information from the data that can’t be seen in a traditional scatterplot. Adding color, fitting trend lines, faceting, and creating hexagonal heat maps can supercharge your scatterplot so you gain more insight from your data.

By Matt Russell. Leave a comment below or email Matt with any questions or comments.

Forget spreading and gathering your R data, try pivoting instead

August 8, 2020
analytics R tidyr pivot_long pivot_wide

Reflections on the American Forestry Conference

August 1, 2020
analytics conference economics

Forest carbon stocks in every US state

July 25, 2020
carbon analytics forest inventory