You are here

Harvard Forest >

Harvard Forest REU Symposium Abstract 2019

  • Title: provExplainR: why does my R script return different results?
  • Author: Khanh H Ngo (Mount Holyoke College)
  • Abstract:

    One common way for a scientist to increase transparency in their scientific work is to archive their data and scripts which contain all analytical and computational steps to yield final results. However, this information only might not be enough to reproduce a result when scripts are rerun on different occasions or shared with other collaborators, for example due to differences in hardware and software. One solution is to collect data provenance which is the record of all elements that contribute to a piece of data, including its intermediate values, operational dependencies, and computing environment. In this project, we support reproducibility by helping scientists find differences between two provenance collections using a package we built called provExplainR. The package inspects provenance collected by rdtLite, reports changes to the provenance, then offers suggestions or explanations for why the results are different. Factors under examination included the hardware and software used to execute the script, versions of attached libraries, use of global variables, modified inputs and outputs, and changes in main and sourced scripts. Based on detected changes, our tool can be used to study how these factors affect the behavior of the script and generate a promising diagnosis of the causes of different script results. This in turn should help scientists reproduce a scientific analysis and provide a reward for those who collect data provenance.

  • Research Category: Ecological Informatics and Modelling