Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
TLDR
Tempus is a novel temporal GEMM framework for AMD Versal AI Edge SoCs, enabling scalable and efficient LLM inference on resource-constrained edge devices.
Key contributions
- Fixed 16 AIE-ML core block scales GEMM through iterative execution and data tiling in Programmable Logic.
- Achieves 607 GOPS at 10.677 W with high-speed cascade streaming and deadlock-free dataflow.
- Outperforms spatial SOTA (ARIES) by 211.2x in platform utility and offers significant resource frugality.
- Enables sustainable and scalable LLM inference on edge devices by minimizing resource utilization.
Why it matters
Edge deployment of LLMs is challenging due to strict resource constraints, where existing solutions often fail. Tempus provides a critical breakthrough by offering a resource-invariant, temporally scalable GEMM framework. This enables efficient and sustainable LLM inference on edge AI hardware, paving the way for broader adoption.
Original Abstract
Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00\% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.