ReproScore: Separating Readiness from Outcome in Research Software Reproducibility Assessment

May 13, 20262605.13275

Sheeba Samuel, Daniel Mietchen, Jungsan Kim, Waqas Ahmed, Martin Gaedke

cs.SE

TLDR

ReproScore is a new framework that separates software reproducibility readiness from execution outcome, improving assessment for digital libraries.

Key contributions

Introduces ReproScore, a two-tier framework separating reproducibility readiness (RRS) from outcome (ROS).
RRS uses 26 sub-metrics across five categories; ROS provides execution-based probes when sandboxes are available.
Evaluated on 423 GitHub repositories, revealing static readiness poorly predicts actual execution success.
Validates the architectural separation as crucial for scalable, reproducibility-aware curation in digital libraries.

Why it matters

This paper addresses a critical gap in assessing research software reproducibility by explicitly separating static readiness from actual execution outcomes. It provides a scalable framework, ReproScore, that can improve curation in digital libraries. This is crucial for ensuring the long-term usability and verifiability of scientific software.

Original Abstract

Digital libraries curate millions of research software artefacts yet lack scalable infrastructure for assessing whether those artefacts remain executable. Existing automated assessment tools treat static repository completeness -- what a repository contains -- as a proxy for execution success -- whether it runs. We term this the readiness-outcome conflation and present ReproScore, a two-tier framework that explicitly separates reproducibility readiness (RRS) from reproducibility outcome (ROS), combining them into a coverage-adaptive Composite Score (RCS). RRS comprises 26 sub-metrics across five categories; ROS provides execution-based probes when sandbox infrastructure is available; a community rubric externalises weighting priorities as versioned YAML profiles. Evaluated on 423 GitHub repositories from a large-scale ground-truth corpus spanning five failure modes, two complementary findings emerge: the environment category strongly discriminates failure mode, confirming static signals capture meaningful structural differences; yet RRS exhibits near-zero binary success correlation, empirically quantifying the readiness-outcome gap at repository scale. Together, these findings validate the architectural separation as both necessary and non-trivial, positioning ReproScore as scalable infrastructure for reproducibility-aware curation in digital library workflows.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers