Data manipulation with dplyr and tidyr

Session 7

September 24, 2024

Review

Informal poll:

Do you collaborate with anyone who could or currently uses GitHub?

Review

Chat:

What would you say to encourage them to start?

Learning objectives

  • Practice tracking our work in git
  • Understand what makes code “good”
  • Understand the principles of “tidy data”
  • Use tidyverse packages (dplyr and tidyr) to work with data in R.
  • Identify how to apply tidy tools to your colloqium project.

GitHub practice

Create a branch for today’s work in your workshop repo.

Discussion: What makes code “good”?

Breakout rooms (5 minutes) to discuss, and then report back to the group.

What makes code “good”?

Suggestions:

  • Readability
  • Reproducibility
  • Accuracy
  • Speed/efficiency

What makes code “good”?

Suggestions:

  • Readability
  • Reproducibility
  • Accuracy
  • Speed/efficiency

Readable code

From the Carpentries lab

  • Use comments to explain decisions
  • Break code into modules (covered next session)
  • Avoid repetition wherever possible (covered next week)
  • Make dependencies clear

Reproducible code

From the Carpentries lab

  • Avoid using commenting out lines to control code
  • Use a test data set
  • Use version control (like GitHub!)
  • Keep track of your software environment

Good enough practices

See the paper here!

What is “tidy data”?

Lowndes and Horst say it so beautifully.

The tidyverse

The “tidyverse”

  • ✨ Common structure and syntax
  • ✨ Emphasize user readability
  • ✨ Friendly documentation
  • ✨ Updates rapidly

The “tidyverse”

  • ‼️ Extra installs over “base R”
  • ‼️ Very different syntax than “base R”
  • ‼️ Updates rapidly (can be a reproducibility challenge)

The “tidyverse”

  • 🧭 Only install/load the functions/packages you need
  • 🧭 Be aware that others may be unfamiliar; comment your code well
  • 🧭 Keep track of your package versions
  • 🧭 Software developers, keep an eye on changes!

Tidy data in R

Let’s install some packages:

install.packages("dplyr") # dataframe manipulation
install.packages("tidyr") # "reshaping" data
install.packages("palmerpenguins") # open teaching dataset

Coding time: dplyr

Core dplyr verbs:

  • select pulls columns
  • filter pulls rows based on values
  • mutate adds or modifies a column
  • group_by + summarize calculates group-wise summary statistics
  • *_join functions combine data frames based on matching columns

A note on pipes

  • Chain functions together into “paragraphs”
  • %>%: older, included in dplyr (ultimately depends on magrittr)
  • |>: included in base R as of 4.1.0

Coding time: tidyr

Key tidyr manipulations:

  • pivot_longer turns columns names into row values
  • pivot_wider creates new columns based on the values of a given field.

Note

Data should be as long as is reasonable (but not longer)!

Coding time: finding the right function

  • ?tidyr::pivot_wider and index pages
  • More ways of getting help: see the Getting Help section!

More git practice

Work through the steps to synchronize the changes you’ve made today with GitHub.

More git practice

Work through the steps to synchronize the changes you’ve made today with GitHub.

Command-line instructions

git add <your script name>

git commit

git push

Resources

Your colloqium project

  • Between now and Thursday, revisit the code in your colloquium project.
  • Identify any sections of the code that could be improved based on the conversations and tools we covered today.
  • Make those changes, if you have the skills. Otherwise, take notes and we will talk more on Thursday.

References

Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal: "Good Enough Practices for Scientific Computing". http://github.com/swcarpentry/good-enough-practices-in-scientific-computing/, 2016.
Wallace, E..W.J., Meynert, A., Zielinski. T., Romanowski. A., et. al., (2022). Good Enough Practices in Scientific Computing: A Lesson (Version 0.1.0). https://doi.org/tbc; also https://github.com/carpentries-lab/good-enough-practices/.