CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

April 13, 20262604.11615

Jinpeng Ye, Chongxi Wang, Wenqing Li, Bin Yuan, Shiyi Wang + 9 more

cs.ARcs.AIcs.DCcs.LG

TLDR

CUTEv2 is a unified, configurable CPU matrix extension for diverse architectures, offering low-overhead integration and high performance.

Key contributions

Proposes CUTEv2, a unified, configurable CPU matrix extension architecture with decoupled units.
Decouples matrix units from the CPU pipeline for low-overhead integration across diverse CPUs.
Features an asynchronous matrix multiplication abstraction for simplified overlap and unified software.
Achieves up to 2.31x speedup on AI models (ResNet, BERT, Llama3) with low area overhead.

Why it matters

Existing matrix extensions suffer from high overhead and integration complexity. CUTEv2 offers a practical, adaptable solution, significantly boosting AI workload performance across diverse CPU architectures with minimal overhead, benefiting the open-source community.

Original Abstract

Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels. This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies matrix-vector overlap, and supports a unified software stack. The architecture is integrated into four open-source CPU RTL platforms and evaluated on representative AI models. Matrix unit utilization under GEMM workloads exceeds 90% across all platforms. When configured with compute throughput and memory bandwidth comparable to Intel AMX, our design achieves speedups of 1.57x, 1.57x, and 2.31x on ResNet, BERT, and Llama3, with over 30% of the gains attributed to overlapped matrix-vector execution. A 4 TOPS@2GHz matrix unit occupies only 0.53 mm\textsuperscript{2} in 14nm CMOS. These results demonstrate strong cross-platform adaptability and effective hardware-software co-optimization, offering a practical matrix extension for the open-source community.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers