This long-term project brings together computer scientists and ecologists to investigate a critical problem in science: how to ensure that scientific data analyses are reproducible. The solution appears to lie in the use of “provenance metadata” to document rigorously how data are transformed in each step of an analysis from start to finish. In our current formulation, this provenance metadata takes the form of two mathematical graphs: a process definition graph (PDG) that specifies the various ways in which a process might unfold; and a data derivation graph (DDG) that describes exactly how a process did unfold in a particular execution.
These abstract concepts from computer science are tested through application to an ongoing project in a domain science, currently the analysis of streaming data from a hydrological sensor network. Recent efforts have focused on defining and executing this analysis using Little-JIL (a high-level graphical process language) and creation of a DDG in memory as the process executes. Work for 2011 will focus on creating a persistent form of the DDG (using database technologies) as well as methods for querying and analyzing DDGs.