Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

April 10, 20262604.08971

Yueyuan Sui, Payal Mohapatra, Doğaç Eldenk, Haodong Yang, Yiting Zhang + 3 more

cs.LG

TLDR

SentryFuse enables efficient multimodal AI on edge devices by using modality-aware zero-shot pruning and sparse attention, reducing memory and latency.

Key contributions

SentryGate learns modality-conditioned importance for zero-shot pruning of attention heads and FFN channels.
SentryAttend uses sparse grouped-query attention, reducing GFLOPs by 15% in multimodal architectures.
Achieves 12.7% higher accuracy than baselines, up to 18% under sensor dropout conditions.
Reduces memory by 28.2% and latency by up to 1.63x without requiring fine-tuning.

Why it matters

This paper introduces a novel framework for efficient multimodal AI on edge devices. It uniquely addresses the challenges of fluctuating power and sensor dropout by enabling zero-shot pruning. This approach significantly improves performance and resource efficiency without costly fine-tuning.

Original Abstract

Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers