You are here

Harvard Forest >

Harvard Forest Symposium Abstract 2018

  • Title: Collecting and Using Data Provenance in R
  • Primary Author: Emery Boose (Harvard Forest)
  • Additional Authors: Aaron Ellison (Harvard Forest); Elizabeth Fong (Mount Holyoke College); Matthew Lau (Harvard Forest); Barbara Lerner (Mount Holyoke College); Thomas Pasquier (Harvard University); Margo Seltzer (University of British Columbia)
  • Abstract:

    The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its magic. This detailed information, and more generally the history of an item of data from its creation to its present state, is known as data provenance. It is our belief that data provenance has great potential to make science more transparent, reliable, and reproducible.

    In our work to date we have focused on collecting data provenance for scripts written in the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. The resulting tools (RDataTracker and provR) are reaching maturity and allow the user to execute an R script, set the level of provenance detail to be collected, and store the provenance in a standard format. A separate tool (DDG Explorer) allows the user to visualize and query the provenance.

    Our experience with users to date suggests that few if any scientists are interested in working with provenance directly, even if it might improve their understanding of their own scripts or the scripts of others, but they might adopt tools that use provenance if those tools perform useful functions. So our future efforts will turn to developing such applications. Examples may include: cleaning a script to remove non-essential elements, identifying all occurrences of a variable for quality control or error propagation, finding which parts of a script require the most computation time, and improving script debugging through access to intermediate data values.

    Project software is available on Github ( For related publications see dataset HF091.

  • Research Category: Ecological Informatics and Modelling