You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2009

  • Title: A Software Engineering Approach to Scientific Data Provenance
  • Author: Cory L TesheraSterne (Mount Holyoke College)
  • Abstract:

    Advanced technology has enabled scientists to design elaborate
    experiments, collecting ecological data from extensive,
    sophisticated sensor networks. In turn, these experiments produce
    far larger, more complex datasets than previously encountered.
    The requirements of funding agencies and collaborations with
    remote scientists lead ecological data producers such as Harvard
    Forest to make datasets publicly available over the Internet. In
    order to be useful to data consumers, scientific datasets need to be
    reliable and reproducible. Both require that the process used to
    produce the data be transparent: consumers need access to data
    provenance, accurate information about the datasets and how they were produced. Collecting this information and presenting it in a useful way
    is a complex research problem that can benefit from approaches originally developed for software engineering.

    Computer scientists at the University of Massachusetts, Amherst
    have developed Little-Jil, a graphical programming language
    capable of organizing tools used to collect, analyze, and manage
    scientific data. An example is a proposed network of sensors at
    Harvard Forest measuring stream flow, precipitation, and other
    hydrological data with the goal of gaining a more complete
    understanding of the water budget of small forested watersheds.
    Research conducted this summer has resulted in Little-Jil
    processes for automated processing and quality control on data
    collected from these sensors, as well as capturing information
    about the process itself as it runs. As this project moves forward,
    this metadata will be stored in a way that will permit any individual
    data value to be traced backward through the process to its origin.
    The resulting software will allow data consumers full knowledge of
    scientific datasets.

  • Research Category: Ecological Informatics and Modelling; Watershed Ecology