Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
Md Zesun Ahmed Mia, Jiahui Duan, Kai Ni, Abhronil Sengupta
TLDR
TrilinearCIM is a novel DG-FeFET architecture that accelerates Transformer self-attention in-memory without costly NVM reprogramming, improving energy efficiency and latency.
Key contributions
- A DG-FeFET-based TrilinearCIM architecture for energy-efficient in-memory Transformer attention.
- Enables reprogramming-free attention computation using back-gate modulation for a three-operand MAC primitive.
- Achieves up to 46.6% energy reduction and 20.4% latency improvement over conventional FeFET CIM.
- Outperforms conventional CIM on 7/9 GLUE tasks (BERT-base) with 37.3% area overhead.
Why it matters
Transformer self-attention is a bottleneck for Compute-in-Memory (CIM) due to dynamic operand reprogramming. TrilinearCIM solves this by enabling complete attention computation exclusively in NVM without runtime reprogramming. This significantly boosts energy efficiency and throughput for AI accelerators.
Original Abstract
Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.