ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment
Sheeba Samuel, Daniel Mietchen, Jungsan Kim, Waqas Ahmed, Martin Gaedke
TLDR
ReproScore is a new framework that separates software reproducibility readiness from execution outcome, improving assessment for digital libraries.
Key contributions
- Introduces ReproScore, a two-tier framework separating reproducibility readiness (RRS) from outcome (ROS).
- RRS uses 26 sub-metrics across five categories; ROS provides execution-based probes when sandboxes are available.
- Evaluated on 423 GitHub repositories, revealing static readiness poorly predicts actual execution success.
- Validates the architectural separation as crucial for scalable, reproducibility-aware curation in digital libraries.
Why it matters
This paper addresses a critical gap in assessing research software reproducibility by explicitly separating static readiness from actual execution outcomes. It provides a scalable framework, ReproScore, that can improve curation in digital libraries. This is crucial for ensuring the long-term usability and verifiability of scientific software.
Original Abstract
Digital libraries curate millions of research software artefacts yet lack scalable infrastructure for assessing whether those artefacts remain executable. Existing automated assessment tools treat static repository completeness -- what a repository contains -- as a proxy for execution success -- whether it runs. We term this the readiness-outcome conflation and present ReproScore, a two-tier framework that explicitly separates reproducibility readiness (RRS) from reproducibility outcome (ROS), combining them into a coverage-adaptive Composite Score (RCS). RRS comprises 26 sub-metrics across five categories; ROS provides execution-based probes when sandbox infrastructure is available; a community rubric externalises weighting priorities as versioned YAML profiles. Evaluated on 423 GitHub repositories from a large-scale ground-truth corpus spanning five failure modes, two complementary findings emerge: the environment category strongly discriminates failure mode, confirming static signals capture meaningful structural differences; yet RRS exhibits near-zero binary success correlation, empirically quantifying the readiness-outcome gap at repository scale. Together, these findings validate the architectural separation as both necessary and non-trivial, positioning ReproScore as scalable infrastructure for reproducibility-aware curation in digital library workflows.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.