Harvard Forest >

Harvard Forest Symposium Abstract 2017

  • Title: Collecting and Visualizing Data Provenance in R
  • Primary Author: Emery Boose (Harvard Forest)
  • Additional Authors: Aaron Ellison (Harvard Forest); Elizabeth Fong (Mount Holyoke College); Matthew Lau (Harvard Forest); Barbara Lerner (Mount Holyoke College); Thomas Pasquier (Harvard University); Margo Seltzer (Harvard University)
  • Abstract:

    Many scientific journals now require that authors publish the data and scripts (if any) used to support their results. But that may not be sufficient to fully understand and replicate a data analysis. It may not be possible, for example, to understand or execute the original script because the documentation is inadequate or because compatible libraries are no longer available, or to replicate inputs that were generated at run time, such as data downloaded from the web. A solution to this problem lies in data provenance, the precise history of a digital artifact from the point of its creation to its present state. But since few (if any) workflow or scripting environments capture data provenance, it has had little impact to date on improving the transparency, reliability, and reproducibility of scientific results.

    In this project we are developing software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our tools include RDataTracker, an R package that collects data provenance in the form of a Data Derivation Graph (or DDG), and DDG Explorer, a separate tool used to visualize and query the resulting DDG. These tools automatically capture data provenance as an R script executes and allow the user to query the resulting DDG in various ways, e.g. to see how a particular data value was derived or used or what lines of source code correspond to a particular step. The ultimate goal of our project is to provide an end-to-end system that integrates provenance tools for R, Python, and operating systems in a common framework with common tools.

    Recent improvements include:

    1. Users can now execute an R script one line at a time, or set breakpoints to pause execution, and view the resulting DDG. This feature provides strong support for script debugging because one can inspect all the data values and steps up to the point where execution was paused, removing the need to insert print statements and rerun the script. It also ensures that the user sees the correct intermediate values for that particular execution of the script.

    2. Data provenance is now captured for statements inside control constructs (if-else, for, while, and repeat). This feature greatly increases the information collected for many scripts. By examining the DDG, one can see at a glance which branch of an if-else statement was executed or how many times a loop was run.

    3. Information about sourced scripts and installed R packages (with version numbers) is now collected. This information is essential for trying to repeat a data analysis in an environment that closely matches the original environment.

    4. Data provenance is now collected for RMarkdown scripts. Annotations in the RMarkdown script are used to create expandable and collapsible sections in the DDG.

    All project software is available for download from Github (http://github.com/End-to-end-provenance) and major releases are included in dataset HF091.

  • Research Category: Ecological Informatics and Modelling