You are here

Harvard Forest >

Summer Research Project 2017

  • Title: Group Project: Data Provenance in R
  • Group Project Leader: Emery Boose
  • Mentors: Emery Boose; Matthew Lau; Barbara Lerner
  • Collaborators: Emery Boose; Aaron Ellison; Elizabeth Fong; Matthew Lau; Barbara Lerner; Margo Seltzer
  • Project Description:

    The ability to understand and replicate a data analysis is enhanced by metadata that describe exactly how the data were created and transformed, including all of the data artifacts and processes used along the way. However, few (if any) workflow or scripting environments currently available capture all of this information (also known as data provenance). Rather, most software used for data analysis is optimized for performance and ease of use. As a result, data provenance has had little impact so far in improving the transparency, reliability, and reproducibility of scientific results.

    Two major challenges must be overcome to bring data provenance within reach of domain scientists. First, the software tools that collect data provenance must be easy to use; ideally the analytical tools that scientists already use should be augmented to provide this service. This is a non-trivial task, since much of the required information is dynamic and must be collected and recorded while the script or program is executing. Second, data provenance (once collected) has the potential to be very large and complex. As a result, scientists will need effective tools for visualizing, querying, and managing these metadata or they will have no practical value.

    In this project we are developing software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our current tools include: (1) RDataTracker, a special library of R functions which collects data provenance in the form of a Data Derivation Graph (or DDG) as an R script executes, and (2) DDG Explorer, a separate tool written in Java and used to visualize, query, and store the resulting DDG. Both tools have been designed for ease of use and for generating and visualizing large DDGs. But many challenges remain before they will achieve maximum usefulness to scientists.

    Next summer a team of students and mentors will work on two closely-related projects that will build on the efforts of earlier REU students. Both projects will involve regular interaction with other scientists and students at Harvard Forest to evaluate emerging tools for creating and using data provenance.

    (1) Visualizing and querying DDGs. The size and complexity of DDGs may render them incomprehensible in the absence of good support tools. The first project will increase the power of our visualization tools by developing features that make it easier for scientists to visualize, query, and compare DDGs. Special attention will be paid to developing a user interface that works for scientists.

    (2) Exploring provenance tracking as a debugging tool. Our current tools provide some support for script debugging, including the ability to set breakpoints, view the DDG as a script executes, and capture R warnings and error messages in the DDG. However the potential uses of these tools for script development and debugging are largely unexplored and will be the focus of the second project.

    Students will spend approximately one half-day per week in the field collecting meteorological and hydrological data and assisting with equipment maintenance. Students will also assist other REU students in the use of R to analyze their data.

    Desired skills:
    Students must have strong software development skills and have experience with (or willingness to learn) R.
    Interests in database and querying technologies, in exploring techniques for analyzing and visualizing data, and in applying software development expertise to address scientific problems are highly desirable.

  • Readings:

    Software & Documentation:

    http://github.com/End-to-end-provenance

    Readings:

    Lerner, B. S. and Boose, E. R. 2014. RDataTracker: Collecting provenance in an interactive scripting environment. 6th Usenix Workshop on the Theory and Practice of Provenance (TAPP), Cologne, Germany.

    Lerner, B. S. and Boose, E. R. 2014. RDataTracker and DDG Explorer: Capture, visualization and querying of provenance from R scripts. International Provenance and Annotation Workshop (IPAW), Cologne, Germany.

    Ellison, A. M. 2010. Repeatability and transparency in ecological research. Ecology 91: 2536-2539.

    Boose, E. R., Ellison, A. M. , Osterweil, L. J. , Podorozhny, R. , Clarke, L. , Wise, A. , Hadley, J. L. , Foster, D. R. 2007. Ensuring Reliable Datasets for Environmental Models and Forecasts. Ecological Informatics 2: 237-247.

    Ellison, A. M., Osterweil, L. J. , Hadley, J. L. , Wise, A. , Boose, E. R., Clarke, L. , Foster, D. R., Hanson, A., Jensen, D. , Kuzeja, P.S., Riseman, E., Schultz, H. 2006. Analytic Webs Support the Synthesis of Ecological Data Sets. Ecology 87: 1345-1358.

  • Research Category: Watershed Ecology, Group Projects, Ecological Informatics and Modelling