You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2010

  • Title: Software support for capturing digital data provenance
  • Author: Sofiya Taskova (Mount Holyoke College)
  • Abstract:

    Scientists perform complex analyses on massive data sets to address research questions. Insufficient documentation of the manipulations used for obtaining desired quantities from raw data may compromise confidence in the results.

    Scientists can increase the reliability and acceptance of their results by providing metadata that describes how the data was collected and processed, such as information on the equipment used, the time and location of the collection and manipulation of the data, computations applied to the data, and identification of when modeled values were substituted for measured values.

    To be consistent and complete, data provenance must be captured during the processing of the data of interest.



    We worked with hydrologists in the Harvard Forest who are measuring precipitation, evapotranspiration and stream discharge to study the role of streams and wetlands in the ecosystem. Their data provenance concern is motivated in part by the need to recalibrate the sensors that output raw data. The automated collection of provenance information is imperative for identifying data items affected by the recalibration and is hence decisive for the reliability of the hydrological data.

    The solution that we propose to the problem of documenting data provenance uses a mathematical graph structure. We introduce two different graphs to represent the provenance of digital data. The Process Definition Graph (PDG) defines the possible ways in which data can be processed. The Dataset Derivation Graph (DDG) describes how a concrete piece of data was processed.

    We are working toward making the data collected by the software accessible to scientists via queries and visualization. Putting the software into practice will inform our future efforts for an optimal provenance capturing architecture.

  • Research Category: Group Projects