Data manipulation

Objective

  • Use git branches in the “real world”
  • Reproducibly clean, summarize, and organize dataframes using tidyverse packages
  • Understand how R stores different data types (data frames, vector types, missing data)
  • Know what the tidyverse is, how it differs from base R, and the philosophy behind using it here.
  • Use the pipe to chain operations together
  • Use dplyr functions to subset data (select, filter; logic and select fxns including where, ==, %in%, !) and manipulate data (mutate; lubridate; split-apply-combine)
  • Use tidyr functions to reshape data (pivot_wider and pivot_longer)

Lesson outline

  • Review from last week
  • Warm-up: create a branch for today’s work
  • Slides/discussion: using R for reproducible data analysis
    • Why use R?
    • What is the tidyverse and why use it?
    • Install dplyr and tidyr
  • Live coding: How R thinks about data
  • Live coding: dplyr and tidyr
  • Live coding: advanced tidyverse topics
    • Options: across; dates; advanced joins; others?
  • Live coding/discussion: getting help
    • reprex
  • Live coding: practice modify-add-commit cycle
  • Homework: None

Installation & materials

  1. Slides
  2. Install R packages ‘dplyr’, ‘tidyr’, ‘readr’
  3. Data carpentry R ecology revamp episode #2
  4. Data carpentry R ecology revamp episode #3

Citation

BibTeX citation:
@online{scott2024,
  author = {Scott, Eric and Diaz, Renata and Guo, Jessica and Riemer,
    Kristina},
  title = {Data Manipulation},
  date = {2024},
  url = {https://cct-datascience.github.io/repro-data-sci/lessons/7-data-manipulation/notes.html},
  doi = {10.5281/zenodo.8411612},
  langid = {en}
}
For attribution, please cite this work as:
Scott, Eric, Renata Diaz, Jessica Guo, and Kristina Riemer. 2024. “Data Manipulation.” Reproducibility & Data Science in R. 2024. https://doi.org/10.5281/zenodo.8411612.