Four ways to supercharge your scatterplots
February 3, 2020
analytics Communicating data Data viz ggplot2 scatterplotsScatterplots are a go-to visualization with forestry data. Scatterplots show the relationship between two quantitative variables measured on the same object. For example, a tree’s diameter and height.
There are a few rules for all scatterplots:
- The explanatory variable goes on the x-axis. (This is usually the “easier” measurement to collect.)
- The response variable goes on the y-axis. (This is usually the “harder” measurement to collect.)
- Each observation is a point on the graph.
Without doing any statistics and deep analytics, scatterplots can reveal a linear or nonlinear pattern, show the direction and strength of a relationship, and expose any outliers in the data.
Two variables are positively associated with one another when above-average values of one are observed with above-average values of the other. Two variables are negatively associated when above-average values of one are found with below-average values of the other.
The following example shows the diameter at breast height (DBH; in inches) and height (HT; in feet) of four species of trees growing in Cloquet, Minnesota:
DBH | HT | Species |
---|---|---|
1.5 | 10 | Balsam fir |
1.3 | 10 | Quaking aspen |
2.3 | 22 | Red pine |
6.0 | 44 | Red pine |
8.3 | 61 | Quaking aspen |
6.9 | 62 | Red pine |
10.7 | 90 | Quaking aspen |
A scatterplot can be drawn with the 1,034 tree observations of DBH-HT pairs for the four most common species in the data. The ggplot()
function along with geom_point()
is a great way to start using the tidyverse suite of functions in R:
# Create the original scatterplot
ggplot(tree, aes(DBH, HT)) +
geom_point() +
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
In many situations, this is where the visualization ends. However, you can do more with a scatterplot with a few variations in your plots. This post describes four tips to “supercharge” your scatterplots using ggplot()
in R to get more insights from your data.
Four ways to supercharge your scatterplots
1. Add color.
Best for: Showing the range of values within different levels of a categorical variable.
Adding color is an effective way to reveal more of what’s behind the data. It works well when the data contain a categorical variable, the number of levels in that variable are not too large (say, less than six), and there may be different distributions between the levels in a variable.
As an example, we can add tree species as a mapping variable in the aes()
statement to add color to our scatterplot. We see that red pine trees are some of the tallest and have the largest diameters:
# Create scatterplot with color showing species
ggplot(tree, aes(DBH, HT, col = Species)) +
geom_point() +
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
2. Add trend lines.
Best for: Revealing trends between two continuous variables.
Adding a trend line can easily reveal a relationship between two continuous variables. This is helpful if you need to make a quick approximation between two variables. For example, after fitting a trend line you could say “A 20-inch diameter tree will be approximately 90 feet tall”.
Adding a trend line can also reveal whether or not a linear or nonlinear relationship exists in the data. The geom_smooth()
function by default fits a smoothed conditional mean to the data, along with confidence intervals surrounding the estimate. Other trend lines such as those from linear regressions (geom_smooth(method = lm)
) can fit linear trends:
# Create scatterplot with trend line
ggplot(tree, aes(DBH, HT)) +
geom_point() +
geom_smooth() +
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
You can add tips #1 and #2 to fit trend lines for each species. In ggplot()
, this is effective because the trend lines don’t extend beyond where there isn’t any data. This is another visualization that makes it easy to see the different minimum and maximum values within a data set:
# Create scatterplot with trend line for each species
ggplot(tree, aes(DBH, HT, col = Species)) +
geom_point() +
geom_smooth() +
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
3. Facet it.
Best for: Splitting up the scatterplot to show the range and number of observations within different levels of a categorical variable.
Another way to easily see the differences in ranges of two continuous variables in a scatterplot is to plot each level of a categorical variable in its own panel. The facet_wrap()
function allows you to do this.
In this case we easily see that red pine trees have a full range of DBH-HT, while the other species have a narrower range:
# Create scatterplot with trend line
ggplot(tree, aes(DBH, HT, col = Species)) +
geom_point() +
facet_wrap(~Species)+
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color="black"))
The facet_wrap()
function works well when you have a single categorical variable to facet. The facet_grid()
function allows you to plot two categorical variables simultaneously.
We don’t have a great example with the Cloquet tree data set that could serve as a second categorical variable. But you could imagine that if we had a tree’s crown class, we could plot the four species vertically and four crown classes horizontally (dominant, co-dominant, intermediate, and suppressed).
4. Hex it.
Best for: Showing the number of observations in a “busy” area of the scatterplot.
A “hexagonal heat map” can be produced in ggplot
that divides the x- and y-axes into hexagons, and the color of that hexagon reflects the number of observations in each hexagon. The geom_hex()
function fills in the number of observations within each hexagon.
Here’s an example with the number of bins along the x- and y-axis set to 25:
# Create scatterplot with trend line
ggplot(tree, aes(DBH, HT)) +
geom_hex(bins = 25) +
labs(x = "Tree diameter (in)", y = "Tree height (ft)")+
theme(panel.background = element_rect(fill = "NA"),
axis.line = element_line(color = "black"))
The hexagonal scatterplot shows that most of the observations in the Cloquet tree data set are less than 12 inches in diameter and are shorter than 60 feet tall. In the original scatterplot, due to overlapping points in “busy” areas of the graph, this finding can’t really be observed. Knowing that this “clustering” exists can be insightful for future data analysis.
The number of bins in geom_hex()
can be increased to see a finer resolution (with fewer observations grouped into each hexagon). Or it can can be decreased to see a coarser resolution (with more observations grouped into each hexagon).
Conclusion
Scatterplots are some of the first visualizations we make when we begin to analyze data. With a few slight modifications and additions to our ggplot
code, we can draw more information from the data that can’t be seen in a traditional scatterplot. Adding color, fitting trend lines, faceting, and creating hexagonal heat maps can supercharge your scatterplot so you gain more insight from your data.
By Matt Russell. Leave a comment below or email Matt with any questions or comments.