You are here

Harvard Forest >

Summer Research Project 2019

  • Title: Group Project: The Fruits of Provenance
  • Group Project Leader: Emery Boose
  • Mentors: Emery Boose; Barbara Lerner
  • Collaborators: Emery Boose; Aaron Ellison; Barbara Lerner; Margo Seltzer
  • Project Description:

    The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its task(s). This detailed information, and more generally the history of an item of data from its creation to its present state, is known as provenance. It is our belief that provenance has great potential to make science more transparent, reliable, and reproducible.

    In this project we have developed tools to collect provenance for the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. These tools allow users to select the level of detail to be collected, execute an R script, and store the resulting provenance in a standard format. The provenance provides a detailed record of the steps that were executed and the intermediate data values that were created in a particular execution of a script.

    Our efforts now focus on developing applications that use provenance to perform tasks that support scientists in their work. Promising applications to date include a "script cleaner" that removes everything in a script not required to produce a particular result, and a "time-traveling debugger" that supports debugging of a script without the need to set breakpoints or insert print statements and rerun the script.

    Next summer a team of students and mentors will work on developing applications that use provenance (from R, Python, or other languages) and on gauging their effectiveness in collaboration with other scientists and students at Harvard Forest. Possible applications include (1) dynamic analysis of scripts for common problems (for example, unintended type conversion in R), (2) capture and display of provenance for R Markdown, which allows users to combine code with a supporting narrative, and (3) comparisons of provenance graphs to see how and why script results differ. Students will work closely with mentors Lerner and Boose on a daily basis and will participate in weekly video conferences with the larger group. Students will also assist other REU students in the use of R to analyze their data.

    Desired Skills: Students must have strong software engineering skills and have experience with (or willingness to learn) R or Python.

  • Readings:

    Software:

    http://end-to-end-provenance.github.io/

    Readings:

    Barbara Lerner, Emery Boose, and Luis Perez. 2018. Using Introspection to Collect Provenance in R. Informatics, 5, 12.

    Thomas Pasquier, Matthew Lau, Xueyuan Han, Elizabeth Fong, Barbara Lerner, Emery Boose, Merce Crosas, Aaron Ellison, and Margo Seltzer. 2018. Sharing and Preserving Computational Analyses for Posterity with encapsulator. IEEE Computing in Science and Engineering (CiSE).

    Thomas Pasquier, Matthew K. Lau, Ana Trisovic, Emery R. Boose, Ben Couturier, Mercè Crosas, Aaron M. Ellison, Valerie Gibson, Chris R. Jones, and Margo Seltzer. 2017. If these data could talk. Nature Scientific Data, 4.

    Marcia McNutt, Kerstin Lehnert, Brooks Hanson, Bran A. Nosek, Aaron M. Ellison, and John Leslie King. 2016. Liberating field science samples and data. Science, 352 (6277).

    Ellison, A. M. 2010. Repeatability and transparency in ecological research. Ecology 91: 2536-2539.

  • Research Category: Group Projects, Ecological Informatics and Modelling