You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2010

  • Title: Improving Provenance Capture Using Examples from Hydrology
  • Author: Morgan A Vigil (Westmont College)
  • Abstract:

    Advanced sensor technology has drastically increased the amount of data researchers can collect in a short time span. Scientific results ensuing from these sensor data must be reproducible and reliable. However, tools or protocols that competently manage these data with the intent of facilitating data reproduction and reliability are lacking. A solution to this problem is data provenance, a record of the processes and tools used to refine raw data into information. A detailed record of the manipulations and systems used to process data into information allows a user to retrospectively trace information through the process, giving credence to the result of the process. By providing transparency to the data refinement process, software designed to collect provenance metadata can help data consumers trust results derived from sensor data.

    Little-JIL, a visual programming language developed at the University of Massachusetts, Amherst, facilitates the collection of such provenance metadata by decomposing the process into individual steps. This discretization facilitates provenance metadata collection by making obvious where data is manipulated, thus assisting capture of the manipulations. From the characteristic steps of data read-in, verification, and manipulation this process was applied to the ecological example of collecting hydrology data from various sensors (and other sources) around Harvard Forest to understand the water budget of a forest watershed. Continuing research begun last summer, research performed this summer has improved the Little-JIL process as a provenance-collecting tool by adding several data collection steps, designing a GUI for users, and abstracting the process to allow for multiple types of gauges. Future research seeks to continue this development as well as address questions that may arise about the provenance of the data – such as the types of equations used, intermediate measures made for individual gauges, and how sensor drift was handled for a particular data set.

  • Research Category: Ecological Informatics and Modelling; Watershed Ecology