Harvard Forest Symposium Abstract 2014

Title: Collecting and Visualizing Data Provenance in R
Primary Author: Emery Boose (Harvard Forest)
Additional Authors: Aaron Ellison (Independent); Barbara Lerner (Mount Holyoke College); Lee Osterweil (University of Massachusetts - Amherst )
Abstract:
Scientific data provenance is the information required to accurately document the history of an item of data, including how it was created and how it was transformed. Data provenance supports replication and validation of data and improves both understanding and sharing of data and results. However data provenance is not widely used by scientists, for a variety of reasons. The standard software tools used by scientists to analyze their data do not (for the most part) collect provenance, while specialized tools (such as workflows) that do often require a significant investment of time to learn. In addition, the study of data provenance has largely remained the focus of computer scientists, who regard it as an interesting and challenging problem. To date there has been little input from domain scientists, who we believe could provide critical insights into what information to collect and how to manage and visualize it.

In this project we are developing software tools to support the collection of data provenance in R, a software environment for data analysis and visualization widely used by environmental scientists, in such a way that a minimum of extra effort is required on the part of the scientist. Our approach involves the use of two tools: (1) The R script is instrumented (annotated) by the scientist with calls to RDataTracker, a special library of R functions, which collects data provenance as the script executes. The provenance takes the form of a mathematical graph (with nodes and edges) that we call a DDG or Data Derivation Graph. (2) DDG Explorer, a separate Java program, may then be used to view and query the provenance graph using a graphical interface and to store it in an RDF database for future use and more extensive querying.

Initial interactions with scientists at the Harvard Forest have led to a series of improvements to this approach, including: (1) the ability to capture provenance from interactive console sessions as well as from the execution of preexisting scripts, (2) the ability to set checkpoint and restore points, which save and restore the R state and associated data files, respectively, allowing users to go back to a previous point in the execution of a script, and (3) the ability to capture R run-time errors when a script fails and incorporate such errors into the resulting data provenance, allowing users to examine execution up to the point of failure and facilitating troubleshooting.

We plan to continue efforts to refine the structure and content of the provenance graph and to make the tools described above easier for scientists to use. We also anticipate that when scientists have ready access to provenance they will develop new and creative uses for provenance that we cannot now foresee.
Research Category: Ecological Informatics and Modelling
Group Projects