You are here

Harvard Forest >

Harvard Forest Symposium Abstract 2019

  • Title: Collecting and Using Provenance in R
  • Primary Author: Emery Boose (Harvard Forest)
  • Additional Authors: Aaron Ellison (Harvard University); Elizabeth Fong (Mount Holyoke College); Matthew Lau (Harvard Forest); Barbara Lerner (Mount Holyoke College); Thomas Pasquier (Harvard University); Margo Seltzer (University of British Columbia)
  • Abstract:

    The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its task. This detailed information, and more generally the history of an item of data from its creation to its present state, is known as provenance. It is our belief that provenance has great potential to make science more transparent, reliable, and reproducible.

    In our work to date we have focused on collecting and using provenance for scripts written in the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our current tools include:

    1. rdtLite collects provenance as an R script executes (or during a console session) and saves it to file in extended prov-json format. Simple data values are automatically saved. Complex data values (e.g. data frames) may optionally be saved (wholly or in part) as separate snapshot files. Available on CRAN.

    2. provSummarizeR creates a concise high-level summary of the provenance collected by rdtLite, including information about the computing environment, loaded libraries, sourced scripts, and inputs & outputs. Available on CRAN.

    3. provViz provides an R interface for a visualization tool (written in Java) that allows the user to view and query the provenance graph directly. Available on CRAN.

    4. provClean uses the provenance collected by rdtLite to create a simplified version of the original R script that contains only those statements needed to produce a specified result. Under development.

    5. provDebugR uses the provenance collected by rdtLite to support time-traveling debugging of an R script without the need to set breakpoints or insert print statements and rerun the script. Under development.

    Project software and additional details are available on GitHub (http://end-to-end-provenance.github.io/). For additional related publications, see dataset HF091.

  • Research Category: Ecological Informatics and Modelling