\n Harvard Forest REU Symposium Abstract Submission Harvard Forest -->

2009 Harvard Forest REU Student Symposium Abstracts

Author:
Cory TesheraSterne - Mount Holyoke College

Title:
A Software Engineering Approach to Scientific Data Provenance

Abstract:

Advanced technology has enabled scientists to design elaborate
experiments, collecting ecological data from extensive,
sophisticated sensor networks. In turn, these experiments produce
far larger, more complex datasets than previously encountered.
The requirements of funding agencies and collaborations with
remote scientists lead ecological data producers such as Harvard
Forest to make datasets publicly available over the Internet. In
order to be useful to data consumers, scientific datasets need to be
reliable and reproducible. Both require that the process used to
produce the data be transparent: consumers need access to data
provenance, accurate information about the datasets and how they were produced. Collecting this information and presenting it in a useful way
is a complex research problem that can benefit from approaches originally developed for software engineering.

Computer scientists at the University of Massachusetts, Amherst
have developed Little-Jil, a graphical programming language
capable of organizing tools used to collect, analyze, and manage
scientific data. An example is a proposed network of sensors at
Harvard Forest measuring stream flow, precipitation, and other
hydrological data with the goal of gaining a more complete
understanding of the water budget of small forested watersheds.
Research conducted this summer has resulted in Little-Jil
processes for automated processing and quality control on data
collected from these sensors, as well as capturing information
about the process itself as it runs. As this project moves forward,
this metadata will be stored in a way that will permit any individual
data value to be traced backward through the process to its origin.
The resulting software will allow data consumers full knowledge of
scientific datasets.


Return to 2009 Symposium Abstracts