Proxics: an efficient programming model for far memory accelerators
Zikai Liu, Niels Pressel, Jasmin Schult, Roman Meier, Pengcheng Xu + 1 more
TLDR
Proxics introduces an efficient programming model for Near-Data Processing (NDP) accelerators using lightweight OS abstractions like virtual processes and IPC.
Key contributions
- Introduces Proxics, an efficient programming model for Near-Data Processing (NDP) accelerators.
- Leverages familiar OS abstractions like virtual processes and IPC channels for NDP.
- Implements these abstractions efficiently for lightweight NDP hardware via compilation.
- Demonstrates performance benefits on real hardware for diverse memory-intensive apps.
Why it matters
This paper addresses the critical need for portable OS abstractions to program emerging Near-Data Processing (NDP) accelerators. Proxics offers an efficient programming model, making these complex systems more accessible for developers. It also underscores the often-neglected importance of low-latency communication between CPU and NDP devices.
Original Abstract
The use of disaggregated or far memory systems such as CXL memory pools has renewed interest in Near-Data Processing (NDP): situating cores close to memory to reduce bandwidth requirements to and from the CPU. Hardware designs for such accelerators are appearing, but there lack clean, portable OS abstractions for programming them. We propose a programming model for NDP devices based on familiar OS abstractions: virtual processors (processes) and inter-process communication channels (like Unix pipes). While appealing from a user perspective, a naive implementation of such abstractions is inappropriate for NDP accelerators: the paucity of processing power in some hardware designs makes classical processes overly heavyweight, and IPC based on shared buffers makes no sense in a system designed to reduce memory bandwidth. Accordingly, we show how to implement these abstractions in a lightweight and efficient manner by exploiting compilation and interconnect protocols. We demonstrate them with a real hardware platform runing applications with a range of memory access patterns, including bulk memory operations, in-memory databases and graph applications. Crucially, we show not only the benefits over CPU-only implementations, but also the critical importance of efficient, low-latency communication channels between CPU and NDP accelerators, a feature largely neglected in existing proposals.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.