Harvard Forest Symposium Abstract 2015

Title: Collecting and Visualizing Data Provenance in R
Primary Author: Emery Boose (Harvard Forest)
Additional Authors: Aaron Ellison (Independent); Barbara Lerner (Mount Holyoke College); Lee Osterweil (University of Massachusetts - Amherst )
Abstract:
The ability to understand and replicate a data analysis is enhanced by metadata that describe exactly how the data were created and transformed, including all of the data artifacts and processes used along the way. However, few (if any) workflow or scripting environments currently available capture all of this information (also known as data provenance). Rather, most software used for data analysis is optimized for performance and ease of use and not for tracking provenance. As a result, data provenance has had little impact so far in improving the transparency, reliability, and reproducibility of scientific results.

Two major challenges must be overcome to bring data provenance within reach of domain scientists. First, the software tools that collect data provenance must be easy to use; ideally the analytical tools that scientists already use should be augmented to provide this service. This is a non-trivial task, since much of the required information is dynamic and must be collected and recorded while the script or program is executing. Second, data provenance (once collected) has the potential to be very large and complex. As a result, scientists will need effective tools for visualizing, querying, and managing these metadata or they will have no practical value.

In this project we are developing two software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. The first tool is RDataTracker, a special library of R functions (written in R) that collects data provenance in the form of a Data Derivation Graph (or DDG) as an R script executes. The second tool is DDG Explorer, a stand-alone program (written in Java) used to visualize, query, and store DDGs. Current versions of these tools and associated documentation are available on the HF website as dataset HF091.

Over the next year, as we continue to develop these tools and gather feedback from scientists at Harvard Forest and elsewhere, we plan to (1) refine the features and user interface of DDG Explorer to make it easier for scientists to understand and make use of data provenance, and (2) explore how data provenance might be used as a debugging tool to assist scientists in developing R scripts.
Research Category: Ecological Informatics and Modelling