Understanding the foundations of ‘ggplot2’

Jessica Guo

CCT Data Science ggplot2 series

In Part 1 of this series, we will:

  • Explore the grammar of graphics

  • Map data to aesthetics

  • Understand layer components

  • Interpret ggplot2 documentation

  • Create a layered plot

  • Introduce function and syntax of visual elements

The grammar of graphics

What is a grammar?

“The fundamental principles or rules of an art or science” - Oxford English Dictionary

  • reveal composition of complicated graphics

  • strong foundation for understanding a range of graphics

  • guide for well-formed or correct graphics

Note

See “The Grammar of Graphics” by Leland Wilkinson (2005) and “A Layered Grammar of Graphics” by Hadley Wickham (2010)

Layered grammar of graphics

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

ggplot2 builds complex plots iteratively, one layer at a time.

  • What are the necessary components of a plot?

  • What are necessary components of a layer?

Components of a plot

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

A plot contains:

  • Data and aesthetic mapping

  • Layer(s) containing geometric object(s) and statistical transformation(s)

  • Scales

  • Coordinate system

  • (Optional) facets or themes

Components of a layer

Cartoon image of three Chinstrap, Gentoo, and Adelie penguins.

A layer contains:

  • Data with aesthetic mapping

  • A statistical transformation, or stat

  • A geometric object, or geom

  • A position adjustment

Mapping data to aesthetics

What data inputs are needed?

Data can be added to either the entire ggplot object or a particular layer.

Input data must be a dataframe in ‘tidy’ format:

  • every column is a variable

  • every row is an observation

  • every cell is a single value

Note

See “Tidy Data” by Wickham (2014) and the associated vignette

Example dataset - raw

species bill_length_mm bill_depth_mm body_mass_g
Adelie 39.1 18.7 3750
Adelie 39.5 17.4 3800
Gentoo 46.7 15.3 5200
Gentoo 43.3 13.4 4400
Chinstrap 46.1 18.2 3250
Chinstrap 51.3 18.2 3750

Example dataset - mapped

aes(x = bill_length_mm,
    y = bill_depth_mm,
    size = body_mass_g,
    color = species)

Variables mapped to aesthetic:

Color x y Size
Adelie 39.1 18.7 3750
Adelie 39.5 17.4 3800
Gentoo 46.7 15.3 5200
Gentoo 43.3 13.4 4400
Chinstrap 46.1 18.2 3250
Chinstrap 51.3 18.2 3750

Where to specify aesthetics?

  • Can be supplied to initial ggplot() call, in individual layers, or a combo

  • ggplot() data and aesthetics are inherited, but can be overridden

Where to specify aesthetics?

  • Can be supplied to initial ggplot() call, in individual layers, or a combo

  • ggplot() data and aesthetics are inherited, but can be overridden

ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, 
                     color = species)) +
  geom_point()
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(aes(color = species))
ggplot() +
  geom_point(data = penguins,
             aes(x = body_mass_g, y = flipper_length_mm, color = species))

Inheritance of aesthetics by layers

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm, 
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 

Inheritance of aesthetics by layers

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm, 
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 

ggplot(penguins, aes(x = body_mass_g, 
                     y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm", 
              se = FALSE) 

Mapping aesthetics to constants

Specifying a constant inside aes() with quotes creates a legend on the fly

ggplot(penguins, 
       aes(x = body_mass_g,
           color = species)) +
  geom_point(aes(y = bill_length_mm, 
                 shape = "Length")) +
  geom_point(aes(y = bill_depth_mm, 
                 shape = "Depth")) +
  ylab("Bill dimensions (mm)") +
  labs(shape = "dimension")

Customizing layers

Under the hood with layer()

A layer contains:

  • Data with aesthetic mapping

  • A statistical transformation, or stat

  • A geometric object, or geom

  • A position adjustment

ggplot() +
  geom_point()
ggplot() +
  layer(mapping = NULL,
        data = NULL,
        geom = "point",
        stat = "identity",
        position = "identity")

Note

All geom_*() or stat_*() calls are customized shortcuts for the layer() function.

The expediency of defaults

  • Defining each of the components of a layer or whole graphic can be tiresome

  • ggplot2 has a hierarchy of defaults

  • So you can make a graph in 2 lines of code!

The short way and the long way

ggplot() +
  geom_point(data = penguins,
             mapping = aes(x = body_mass_g,
                           y = flipper_length_mm))

The short way and the long way

ggplot() +
  geom_point(data = penguins,
             mapping = aes(x = body_mass_g,
                           y = flipper_length_mm))
ggplot() +
  layer(data = penguins,
        mapping = aes(
          x = body_mass_g,
          y = flipper_length_mm),
        geom = "point", 
        stat = "identity",
        position = "identity") +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()

stat_* vs. geom_*

“Every geom has a default statistic, and every statistic has a default geom.” - Wickham (2010)

  • stat_* transforms the data
    • By computing or summarizing from original input dataset
    • Returns a new dataset that can be mapped to aesthetics
  • geom_* control the type of plot rendered

Tip

When in doubt, check the documentation

Two ways to plot counts (categorical)

stat_count() and geom_bar() are equivalent

ggplot(data = penguins, 
       mapping = aes(x = species, 
                     fill = sex)) +
  stat_count()

ggplot(data = penguins, 
       mapping = aes(x = species, 
                     fill = sex)) +
  geom_bar()

Two ways to plot density (continuous)

stat_density() and geom_density() are not equivalent

ggplot(data = penguins, 
       mapping = aes(x = body_mass_g, 
                     fill = species)) +
  stat_density(alpha = 0.5)

ggplot(data = penguins, 
       mapping = aes(x = body_mass_g, 
                     fill = species)) +
  geom_density(alpha = 0.5)

When to use which?

In general, use geom_*() unless you are trying to:

penguins %>%
  count(species) %>%
  ggplot(aes(x = species, y = n)) +
  geom_bar(stat = "identity")

ggplot(penguins, aes(x = species, 
                     y = after_stat(prop),
                     group = 1)) +
  geom_bar()

ggplot(penguins) +
  stat_summary(aes(x = species,
                   y = body_mass_g),
               fun.min = min,
               fun.max = max,
               fun = mean)

A panopoly of layer options!

Track all geom and stat options

Exercise

For each of the following problems, suggest a useful geom:

  1. Display how a variable has changed over time
  2. Show the detailed distribution of a single variable
  3. Focus attention on one portion of a large dataset
  4. Draw a map
  5. Label outlying points

Position adjustment options

ggplot(data = penguins, mapping = aes(x = species, fill = sex)) +
  geom_bar(position = "stack")

ggplot(data = penguins, mapping = aes(x = species, fill = sex)) +
  geom_bar(position = "fill")

ggplot(data = penguins, mapping = aes(x = species, 
                     fill = sex)) +
  geom_bar(position = "dodge")

Position adjustment options

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = "identity")

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = "jitter")

ggplot(data = penguins, mapping = aes(x = species, y = body_mass_g, color = sex)) +
  geom_point(position = position_jitterdodge())

Position adjustments limitations

For example, boxplots and errorbars can’t be stacked.

Exercise

  • What properties must a geom possess to be stackable?

  • What properties must a geom possess to be dodgeable?

Code-along exercise

Recreating a layered plot

Exercise

What are the two layers in this plot? What data when into each?

Adjusting visual elements

Scales and guides

  • Each scale is a function that translate data space (in data units) into aesthetic space (e.g., pixels)

  • A guide (axis or legend) is the inverse function, that converts visual properties back to data

Scales and guides

  • Each scale is a function that translate data space (in data units) into aesthetic space (e.g., pixels)

  • A guide (axis or legend) is the inverse function, that converts visual properties back to data

Labeled ggplot figure indicating similarity between axes and legends

Are axes and legends equivalent?

Scale specification

Every aesthetic in a plot is associated with exactly one scale.

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species))
ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  scale_x_continuous() + 
  scale_y_continuous() + 
  scale_colour_discrete()

Scale functions names are made of 3 pieces separated by “_”:

  1. scale

  2. the name of the primary aesthetic (color, shape, x)

  3. the name of the scale (discrete, continuous, brewer)

What does a coordinate system do?

Coordinate systems have 2 primary roles:

  1. Combine the x and y position aesthetics to produce a 2-dimensional position on the plot

  2. In coordination with faceting (optional), draw axes and panel backgrounds

Types of coordinate systems

Linear:

  • coord_cartesian(): common default

  • coord_flip(): x and y axes flipped

  • coord_fixed(): fixed aspect ratio

Non-linear:

  • coord_map()/coord_quickmap()/coord_sf(): map projections, x and y become longitude and latitude

  • coord_polar(): polar coordinates, x and y become angle and radius

  • coord_trans(): apply transformations

Faceting

Creates small multiples to show different subsets:

  • facet_null(): default

  • facet_wrap(): “wraps” a 1d ribbon of panels into 2d

  • facet_grid(): 2d grid of panels defined by row and column

Comparison of facet_wrap and facet_grid organization

Keeping points of reference

Exercise

Recreate the figure below. How would you get the gray points to show up on all facets?

Theming

Controls non-data elements of plots (e.g., to match a style guide).

  1. Theme elements specify the non-data elements you can control: plot.title, legend.position

  2. Each element has an element function to describe its visual properties: element_text(), element_blank()

  3. The theme() function allows overriding of the default theme: theme(legend.title = element_blank())

Complete themes

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_bw()

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_minimal()

ggplot(penguins, 
       aes(x = body_mass_g,
           y = flipper_length_mm)) +
  geom_point(aes(color = species)) +
  theme_classic()

Further resources

  • Penguin artwork by @allison_horst

  • Hadley Wickham’s “A layered grammar of graphics” (2010)

  • Hadley Wickham’s “ggplot2: Elegant Graphics for Data Analysis, 3rd edition”, now available online

  • “R for Data Science”, by Hadley Wickham, Mine Cetinkaya-Rundel, & Garret Grolemund, especially chapters 2, 10, and 12

  • See us at drop-in hours