Data manipulation

Objective

Use git branches in the “real world”
Reproducibly clean, summarize, and organize dataframes using tidyverse packages
Understand how R stores different data types (data frames, vector types, missing data)
Know what the tidyverse is, how it differs from base R, and the philosophy behind using it here.
Use the pipe to chain operations together
Use dplyr functions to subset data (select, filter; logic and select fxns including where, ==, %in%, !) and manipulate data (mutate; lubridate; split-apply-combine)
Use tidyr functions to reshape data (pivot_wider and pivot_longer)

Lesson outline

Review from last week
Warm-up: create a branch for today’s work
Slides/discussion: using R for reproducible data analysis
- Why use R?
- What is the tidyverse and why use it?
- Install dplyr and tidyr
Live coding: How R thinks about data
- Data carpentry R ecology revamp episode #2
- Data frames
- Vectors and data types
- Missing data
Live coding: dplyr and tidyr
- Data carpentry R ecology revamp episode #3
- Recent DC + R lesson
- Chaining lines together with the pipe
  - %>%, |>
- Subsetting and filtering data
  - incl. selection and pick https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
- Adding columns
- Split-apply-combine
- Reshaping
Live coding: advanced tidyverse topics
- Options: across; dates; advanced joins; others?
Live coding/discussion: getting help
- reprex
Live coding: practice modify-add-commit cycle
Homework: None

Installation & materials

Slides
Install R packages ‘dplyr’, ‘tidyr’, ‘readr’
Data carpentry R ecology revamp episode #2
Data carpentry R ecology revamp episode #3

Citation

BibTeX citation:

@online{scott2024,
  author = {Scott, Eric and Diaz, Renata and Guo, Jessica and Riemer,
    Kristina},
  title = {Data Manipulation},
  date = {2024},
  url = {https://cct-datascience.github.io/repro-data-sci/lessons/7-data-manipulation/notes.html},
  doi = {10.5281/zenodo.8411612},
  langid = {en}
}

For attribution, please cite this work as:

Scott, Eric, Renata Diaz, Jessica Guo, and Kristina Riemer. 2024. “Data Manipulation.” Reproducibility & Data Science in R. 2024. https://doi.org/10.5281/zenodo.8411612.