A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC
Petter Sandås, Íñigo Aréjula-Aísa, Sergio Iserte, Antonio J. Peña
TLDR
This paper introduces a test taxonomy and CI ecosystem for robust dynamic resource management in HPC, improving fault detection and maintenance.
Key contributions
- Introduces a test taxonomy for MPI malleable libraries, structuring functional and non-functional tests.
- Develops an HPC-oriented continuous integration (CI) ecosystem for dynamic resource management.
- Instantiates the taxonomy in a containerized virtual cluster for automated validation.
- Evaluated using the Dynamic Management of Resources (DMR) framework as a representative case study.
Why it matters
This methodology improves early fault detection and simplifies maintenance for dynamic resource management in HPC. It also offers a transferable solution for validating other malleability frameworks.
Original Abstract
High-performance computing (HPC) systems are increasingly exploring dynamic resource management and malleable MPI applications to better adapt to heterogeneous architectures, fluctuating workloads, and energy constraints. However, the correctness of the libraries that support these techniques is often evaluated through ad hoc experiments that can be difficult to reproduce and maintain. This article introduces methodology for testing dynamic resource management frameworks that combines a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration (CI) ecosystem. The taxonomy structures functional and non-functional tests at both component-integration and system levels. The CI ecosystem instantiates this taxonomy in a containerized virtual cluster enabling automated validation. The approach is instantiated and evaluated using the Dynamic Management of Resources (DMR) framework as a representative case study. Results show that the proposed methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.