You are here

Harvard Forest >

Harvard Forest Symposium Abstract 2016

  • Title: Collecting and Visualizing Data Provenance in R
  • Primary Author: Emery Boose (Harvard Forest)
  • Additional Authors: Aaron Ellison (Harvard University); Barbara Lerner (Mount Holyoke College); Margo Seltzer (University of British Columbia)
  • Abstract:

    The ability to understand and replicate a data analysis is enhanced by metadata that describe exactly how the data were created and transformed. However, few (if any) workflow or scripting environments currently available capture this information (also known as data provenance). As a result, data provenance has had little impact so far in improving the transparency, reliability, and reproducibility of scientific results.



    In this project we are developing two software tools to make data provenance available to users of the R statistical language. The first tool is RDataTracker, a special library of R functions (written in R) that collects data provenance in the form of a Data Derivation Graph (or DDG) as an R script executes. The second tool is DDG Explorer, a stand-alone program (written in Java) used to visualize, query, and store DDGs. Current versions of these tools and associated documentation are available on the HF website as dataset HF091.



    Recent improvements to RDataTracker include adding the ability to (1) capture provenance for function input and return values, for statements inside functions, and for sourced scripts, (2) associate procedure nodes in the DDG with line numbers in the R script (and sourced scripts, if any), (3) automatically create data nodes for input files, objects taken from the initial environment, and plots not written to file, (4) capture warnings in the DDG, (5) record elapsed time for each operation, and (6) save the DDG in Prov JSON format. DDG Explorer was improved by adding the ability to (1) display elapsed time for individual operations and for start-finish blocks, and (2) search for nodes in the DDG by type and name.



    Plans for the coming year include (1) incorporating some of the visualization and querying capabilities of DDG Explorer directly into RStudio, and (2) exploring how RDataTracker might be used to support the development and debugging of R scripts.

  • Research Category: Ecological Informatics and Modelling

  • Figures:
  • data derivation graph.gif