You are here

Harvard Forest >

Harvard Forest Research Project 2024

  • Title: Collecting and Using Provenance in R
  • Principal investigator: Emery Boose (boose@fas.harvard.edu)
  • Institution: Harvard Forest
  • Primary contact: Emery Boose (boose@fas.harvard.edu)
  • Team members: Aaron Ellison
    Barbara Lerner
    Margo Seltzer
  • Abstract:

    The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its task(s). This detailed information, and more generally the history of an item of data from its creation to its present state, is known as provenance. It is our belief that provenance has great potential to make science more transparent, reliable, and reproducible.

    In this project we have developed tools to collect provenance for the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. These tools allow users to select the level of detail to be collected, execute an R script or enter commands in a console session, and store the resulting provenance in a standard format. The provenance provides a detailed record of the steps that were executed and the intermediate data values that were created in a particular execution of a script.

    Recent work has focused on developing applications that use provenance to perform tasks that support scientists in their work. Current tools available on CRAN include: rdtLite (collect provenance), provGraphR (create adjacency matrix), provParseR (extract stored provenance), provSummarizeR (summarize provenance), provViz (visualize provenance), provDebugR (debug script), provExplainR (explain why results differ), and provTraceR (trace file lineage).

    Project website & software are available on Github (http://github.com/End-to-end-provenance).