Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

May 1, 20262605.00536

cs.DCcs.ARcs.LGcs.PFcs.RO

TLDR

Tempus is a novel temporal GEMM framework for AMD Versal AI Edge SoCs, enabling scalable and efficient LLM inference on resource-constrained edge devices.

Key contributions

Fixed 16 AIE-ML core block scales GEMM through iterative execution and data tiling in Programmable Logic.
Achieves 607 GOPS at 10.677 W with high-speed cascade streaming and deadlock-free dataflow.
Outperforms spatial SOTA (ARIES) by 211.2x in platform utility and offers significant resource frugality.
Enables sustainable and scalable LLM inference on edge devices by minimizing resource utilization.

Why it matters

Edge deployment of LLMs is challenging due to strict resource constraints, where existing solutions often fail. Tempus provides a critical breakthrough by offering a resource-invariant, temporally scalable GEMM framework. This enables efficient and sustainable LLM inference on edge AI hardware, paving the way for broader adoption.

Original Abstract

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00\% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers