TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
Chaoyao Shen, Linfeng Jiang, Yixian Shen, Tao Xu, Guoqing Li + 3 more
TLDR
TCL introduces a continual learning framework for fast and efficient cross-hardware tensor program optimization, significantly reducing tuning time and latency.
Key contributions
- RDU Sampler: an active learning strategy selecting only 10% of programs, cutting data collection costs while retaining accuracy.
- Mamba-based cost model: efficiently captures long-range schedule dependencies with reduced parameters for accuracy/cost trade-off.
- Continuous knowledge distillation: progressively transfers knowledge across platforms, avoiding multi-task learning issues.
Why it matters
Existing DL compilers struggle with high data collection costs and poor transferability across hardware. TCL solves this with a novel framework, drastically speeding up tensor program optimization and improving inference latency across diverse platforms. This makes DL compiler auto-tuning more practical and efficient.
Original Abstract
Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.