Harvard Forest >

Summer Research Project 2018

  • Title: Group Project: Data Provenance in R
  • Summer Supervisors: Emery Boose; Barbara Lerner
  • Researchers: Emery Boose; Aaron Ellison; Elizabeth Fong; Matthew Lau; Barbara Lerner; Margo Seltzer
  • Project Description:

    The ability to understand and replicate a data analysis is enhanced by metadata that describe exactly how the data were transformed, including intermediate values created and steps performed in the course of the analysis. However, few (if any) workflow and scripting environments are designed to capture this information (also known as data provenance). As a result, data provenance has had little impact so far on improving the transparency, reliability, and reproducibility of scientific results.

    In this project we are developing software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our current tools include RDataTracker, a special library of R functions which executes an R script, collects data provenance, and saves it in a standard format (Prov-JSON). A second tool, DDG Explorer, allows the scientist to visualize and query the resulting data provenance.

    The next step in our project will be the development of software tools that build on data provenance and further the work of the scientist. Possible applications include: (1) script debugging (troubleshooting a script using details on intermediate data values and steps executed), (2) reproducibility (producing the same result by identifying and reusing critical inputs), and (3) quality control (investigating questionable data values by tracing the execution path from a given input or to a given output).

    Next summer a team of students and mentors will work on developing such applications and gauging their effectiveness in collaboration with other scientists and students at Harvard Forest. Students will work closely with mentors Lerner and Boose on a daily basis and will participate in weekly video conferences with the larger group. Students will also spend a couple of hours each week in the field collecting meteorological and hydrological data and will assist other REU students in the use of R to analyze their data.

    Desired Skills: Students must have strong software engineering skills and have experience with (or willingness to learn) R.

  • Readings:




    Pasquier, T., Lau, M., Trisovic, A., Boose, E. R., Couturier, B., Crosas, M., Ellison, A., Gibson, V., Jones, C. R., Seltzer, M. 2017. If these data could talk. Scientific Data 4:170114.

    Lerner, B. S. and Boose, E. R. 2014. RDataTracker: Collecting provenance in an interactive scripting environment. 6th Usenix Workshop on the Theory and Practice of Provenance (TAPP), Cologne, Germany.

    Lerner, B. S. and Boose, E. R. 2014. RDataTracker and DDG Explorer: Capture, visualization and querying of provenance from R scripts. International Provenance and Annotation Workshop (IPAW), Cologne, Germany.

    Ellison, A. M. 2010. Repeatability and transparency in ecological research. Ecology 91: 2536-2539.

  • Research Category: Group Projects, Ecological Informatics and Modelling