You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2022

  • Title: Support for Collecting Data Provenance from RMarkdown Documents
  • Author: Sean N Fabrega (Mount Holyoke College)
  • Abstract:

    Multiple STEM fields are experiencing a "reproducibility crisis" since many researchers cannot reproduce the results of previous work. Sometimes, this is because of lacking software and data availability. Data provenance, the information about how processing alters or uses data during exploration, addresses this issue by recording details about each processing step and the computing environment at execution time. Tools such as rdtLite, an R package developed at Harvard Forest and Mount Holyoke College, support data provenance collection of processes that use R. Two main types of files are used for data exploration in R, R scripts and RMarkdown documents. With RMarkdown, a user can embed code segments into a text document. These segments or "chunks" can be run individually or all at once, allowing for code organization into specific tasks and clearer annotations. Prior to my project, when rdtLite processed RMarkdown documents, they were converted into R Scripts and then provenance was collected on those generated scripts. Thus, the benefits of RMarkdowns, such as the "chunk-style” organization, were not used in provenance collection. In this project, a new method for provenance collection was added to rdtLite. Users can now request detailed provenance collection on specific chunks by adding "details = TRUE" to chunk headers. In addition, rdtLite’s visualization tool for displaying data exploration, provViz, captures chunk information to make the provenance more understandable. This method for collecting provenance from RMarkdowns allows users to specify chunks for detailed provenance collection, decreasing run time and capturing chunk information into the provenance.

  • Research Category: Ecological Informatics and Modelling; Group Projects