Data manipulation with dplyr and tidyr

Session 6

September 21, 2023

Review

Informal poll:

Do you collaborate with anyone who could or currently uses GitHub?

Review

Chat:

What would you say to encourage them to start?

Review

Chat:

How would you get set up? What would be your core workflow?

Warm-up

Create a branch for today’s work.

Note

Create this branch from a repo you own, not the demo from Tuesday!

We might need to create a repo for R scripting for this workshop.

Learning objectives

  • Understand the principles of “tidy data”
  • Use tidyverse packages (dplyr and tidyr) to work with data in R.
  • Practice tracking our work in git

What is “tidy data”?

Lowndes and Horst say it so beautifully.

The tidyverse

The “tidyverse”

  • ✨ Common structure and syntax
  • ✨ Emphasize user readability
  • ✨ Friendly documentation
  • ✨ Updates rapidly

The “tidyverse”

  • ‼️ Extra installs over “base R”
  • ‼️ Very different syntax than “base R”
  • ‼️ Updates rapidly

The “tidyverse”

  • 🧭 Only install/load the functions/packages you need
  • 🧭 Be aware that others may be unfamiliar; comment your code well
  • 🧭 Keep track of your package versions
  • 🧭 Software developers, keep an eye on changes!

Tidy data in R

Let’s install some packages:

install.packages("dplyr") # dataframe manipulation
install.packages("tidyr") # "reshaping" data
install.packages("palmerpenguins") # open teaching dataset

Coding time: Data types in R

Key points:

  • Major data types in R include characters, integers, double (numeric), Boolean (T/F), and factors.
  • Common tabular formats are data.frame or tibble
  • You can perform “vectorized” operations on whole vectors (or columns) at once.
  • Missing data are handled as NAs.

Coding time: dplyr

Core dplyr verbs:

  • select pulls columns
  • filter pulls rows based on values
  • mutate adds or modifies a column
  • group_by + summarize calculates group-wise summary statistics
  • *_join functions combine data frames based on matching columns

A note on pipes

  • Chain functions together into “paragraphs”
  • %>%: older, included in dplyr (ultimately depends on magrittr)
  • |>: included in base R as of 4.1.0

Coding time: tidyr

Key tidyr manipulations:

  • pivot_longer turns columns names into row values
  • pivot_wider creates new columns based on the values of a given field.

Note

Data should be as long as is reasonable (but not longer)!

Coding time: finding the right function

  • ?tidyr::pivot_wider and index pages
  • More ways of getting help: see the Getting Help section!

Wrapping up

Work through the steps to synchronize the changes you’ve made today with GitHub.

Wrapping up

Work through the steps to synchronize the changes you’ve made today with GitHub.

Command-line instructions

git add <your script name>

git commit

git push

Homework

  • Find a script you have used in one of your projects and translate it to use these verbs.
    • Bonus points (in the form of 🎉🎊✨) if you look up a new function and apply it!