Reproducibility & Data Science in R

Session 1

September 5, 2023

What is Reproducibility & Why?

By The Turing Way Community

The Whole Picture

A (usually fiction) story:

You read a great paper and think “I bet I could apply their analysis methods to my work!” You click a DOI link in their Data Availability section that they definitely have. It opens to a web page where you can download a folder with R code, data, and documentation about the code and data. The page also has detailed information about how to cite the code and data. You open the folder in R Studio and you are prompted to install all the packages you need to run the code. You open the analysis code script and hit “run”. All the code runs perfectly with no errors, creating all the figures, tables, and statistics used in the paper. You scroll through the well-formatted R code and understand from the authors’ comments exactly what the code does and how to adapt it to your work.

We want to help you make this story a reality for someone else!

Syllabus & Workshop Materials

Workshop series website:
https://cct-datascience.github.io/repro-data-sci/

Screen Setup

  • Dual monitors will be very helpful
  • Virtual desktops (“Spaces” in macOS) also helpful
  • Let us know if you do not have access to a second monitor
Photo of a desk with a laptop on a laptop stand to the left and a larger monitor to the right in front of an external keyboard.  Laptop monitor has Zoom open and the larger monitor has RStudio open.

Creating a Research Compendium with R

Learning Objectives:

  1. Use RStudio projects to create self-contained reproducible projects
  2. Use best practices for organizing files in a project
  3. Use relative file paths to improve portability of projects
  4. Structure R scripts so they are easier to understand

Settings for Success

image of RStudio global settings pane with the "Workspace" and "History" sections highlighted.  None of the boxes are checked and the dropdown for "Save workspace to .RData on exit" is set to "never"

  • Fresh start ensures reproducibility

  • Use Session > Restart R to check reproducibility

  • If long-running code is a concern, there are better solutions

Project Management

Research Compendium Best Practices

  • Treat data as read-only

  • Use scripts to “clean” and wrangle data

  • Treat generated outputs as disposable

  • Put data, code, and outputs in different folders

RStudio projects (live coding)

R code best practices (live coding)

Takeaways

  • Structure files in self-contained projects or “research compendia”
  • Put data in a separate folder and never edit raw data!
  • Avoid setwd() and getwd()—use relative paths and RStudio projects instead
  • Naming things well is difficult but worth spending time on
  • Use some consistent style in your code and organize scripts in sections
  • Split long scripts into multiples and use source() to run them if needed’

Homework (optional)

Re-organize an existing project into a research compendium

OR

Apply a consistent coding style to one of your R scripts (e.g. with Code > Reformat Code or with the styler package)