Foundations of {ggplot2}

Eric R. Scott

2024-06-06

Learning Objectives

  • Understand the “grammar of graphics” and how it inspired ggplot2’s design
  • Know what it means to “map data to aesthetics”
  • Learn to plot different data sets on the same visualization
  • Understand the relationship between a “geom” and a “stat”
  • Learn to customize scales and guides

A Grammar of Graphics

  • A book by Wilkinson (2005)
  • Inspired the development of many graphics programs including ggplot2 (Wickham 2010)
  • Alternative to having a function for every kind of plot
  • A framework for layering elements to create any kind of plot

Grammar of Graphics Components

  1. Data
  2. Aesthetics
  3. Scales (and guides)
  4. Geometric objects
  5. Statistics
  6. Facets
  7. Coordinate system

Data

Observations and variables to be visualized.

What data is being visualized?

  • Body mass measured in grams

  • Bill length measured in mm

  • Island and species categories

Aesthetics

Visual elements (color, shape, position, size, etc.) used to encode data.

What aesthetics are used to encode these data?

  • Color

  • x position

  • y position

Scales (and guides)

Scales translate data units into visual units, guides translate visual units back to data units.

What scales are used?

  • Continuous and linear x and y axes (scales)
  • Discrete (categorical) color scale

Geometric Objects (“geoms”)

Objects, often having multiple aesthetics, that represent data visually.

What geometric objects are used?

  • Data is represented using circles/points
  • Trend is represented as a line

Statistics (“stats”)

Any calculations or transformations applied to the data in order to plot it.

What “stats” are used?

  • For points, none (stat = “identity”)

  • For trend lines, linear regression

Facets

Plots can be split into small multiples or “facets” by a variable.

What is the faceting variable?

  • Faceted by island

Coordinate System

How spatial positions are represented on paper (or screen)—e.g. map projections.

What coordinate system is used?

  • Cartesian coordinates

Practice

Identify each of the seven components of this plot

  1. Data
  2. Aesthetics
  3. Scales
  4. Geometric Objects
  5. Statistics
  6. Facets
  7. Coordinate System

Break ⏰

10:00

ggplot2 and the Grammar of Graphics

Data

  • Data is inherited from ggplot() by all layers, but can be overridden for specific layers

  • Worked example: jitter plot of raw data with mean ± standard deviation

library(tidyverse)
library(palmerpenguins)

#summarize dataset
peng_summary <- 
  penguins |> 
  group_by(island) |> 
  summarize(
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    lower_sd = mean_mass - sd(body_mass_g, na.rm = TRUE),
    upper_sd = mean_mass + sd(body_mass_g, na.rm = TRUE)
  )

ggplot(peng_summary, aes(x = island, y = mean_mass)) +
  #mean
  geom_point(shape = "square", color = "blue", size = 2.5) +
  #sd
  geom_errorbar(
    data = peng_summary,
    aes(y = mean_mass, ymin = lower_sd, ymax = upper_sd),
    width = 0.1,
    color = "blue"
  ) +
  #add raw data:
  geom_jitter(
    data = penguins,
    aes(y = body_mass_g),
    alpha = 0.4,
    height = 0
  )

Aesthetics

  • Aesthetics are inherited when placed in ggplot() but can also be specified per layer
  • Aesthetics can be mapped to data or set as constant
  • All the aesthetics and their possible values: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
  • Worked example: box plot for each island layered ontop of jitter plot for each island x sex combination.

Aesthetic mappings supplied to ggplot() are inherited, aesthetic mappings supplied to a geom only affect that geom.

ggplot(penguins, aes(x = island, y = body_mass_g)) +
  geom_boxplot() +
  geom_jitter(aes(color = sex))

Scales

  • scale_ functions affect the range (limits) and breaks of scales and the labels and appearance of corresponding guides.
  • Worked example: 1) re-order and re-color a discrete color scale, 2) change the number of breaks on the axes

p + 
  scale_color_manual(
    values = c(
      "Adelie" = "#7570b3",
      "Chinstrap" = "#d95f02",
      "Gentoo" = "#1b9e77"
    ),
    breaks = c("Gentoo", "Chinstrap", "Adelie")
  ) +
  scale_x_continuous(
    n.breaks = 10
  ) +
  scale_y_continuous(
    breaks = seq(from = 30, to = 65, by = 5.5)
  )

Geoms

  • Every geom has a default stat, but it can be overridden
  • Not all geoms use the same aesthetics
  • Worked example: explore the anatomy of a help file (e.g. ?geom_point())
    • How can you determine the default stat for a geom?
    • How can you find out what aesthetics a geom has?

df <- expand_grid(x = LETTERS[1:5], y = 1:5)
ggplot(df) +
  geom_point(aes(x = x, y = y, 
        color = x, shape = x,
        size = y, alpha = y, stroke = y))

Caution

With great power, comes great responsibility! It’s not always a good idea to map data to aesthetics just because you can. Stay tuned for part 2 of this series for more!

Stats

  • Every stat has a default geom and you usually use the geom_*() function
  • Some stats calculate multiple values that are available with after_stat()
  • Worked examples:
    1. Adding mean ± SD on top of jitter plot with stat_summary()
    2. Making a binned density plot with geom_histogram() and after_stat()

stat_summary()

ggplot(penguins, aes(x = island, y = body_mass_g)) +
  geom_jitter(alpha = 0.4, height = 0) +
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1),
               color = "blue", shape = "square")

Binned density plot with geom_histogram() and after_stat()

ggplot(penguins) +
  geom_histogram(aes(x = body_mass_g, y = after_stat(density))) +
  facet_wrap(vars(island))

Facets

  • Implemented with facet_wrap() and facet_grid()
  • Facets will be explored more in part 2 of this series

Coords

  • Adjusting x and y limits with coord_cartesian() is different than adjusting limits in a scale
  • coord_polar() for polar data, coord_sf() for maps
  • Worked example: zooming in on data

Setting axis limits in scale_x_continuous() removes data that is out of range

p + scale_x_continuous(limits = c(4000, 5000))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 222 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 222 rows containing missing values or values outside the scale range
(`geom_point()`).

Setting axis limits in coord_cartesian() simply zooms in

p + coord_cartesian(xlim = c(4000, 5000))
`geom_smooth()` using formula = 'y ~ x'

Resources

References

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.
Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Statistics and Computing. New York: Springer-Verlag. https://doi.org/10.1007/0-387-28695-0.