You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2011

  • Title: Capturing, persisting, and querying the provenance of scientific data
  • Author: Sofiya Taskova (Mount Holyoke College)
  • Abstract:

    Scientists use technology ubiquitously to collect and process data. They often use software to handle massive datasets and produce scientific results, and post those results on the web, making them readily available to the public. Flaws and differences in the way data is collected and processed can make the results less useful for interpretation. The ability to trace the provenance, or history, of any given result is essential for ensuring the authenticity and reproducibility of that result, as well as for improving the result by incorporating corrections in its processing. Data provenance is defined as the information describing all entities - procedures and data - that were involved in producing a result. We aim to create a software tool that provides provenance for scientific data analyses. It is essential that the user is able to derive meaningful answers to interesting provenance questions with our software. We used a process definition written in the graphical programming language Little-JIL to generate a graph (Data Derivation Graph or DDG) documenting the provenance of the data for each process execution. We stored the DDG into an RDF (Resource Description Framework) database and made it available for querying. We are exploring whether our software collects and stores an adequate amount of provenance to verify results and serve a foundation for reenacting processes. We are also looking to find whether our software supports useful queries and can display the results so that they are easily navigated.

  • Research Category: Ecological Informatics and Modelling