MAny: Merge Anything for Multimodal Continual Instruction Tuning

April 15, 20262604.14016

Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun + 4 more

cs.LGcs.AI

TLDR

MAny introduces a training-free framework to combat dual-forgetting in multimodal continual instruction tuning by merging task-specific knowledge.

Key contributions

Addresses "dual-forgetting" in MLLMs across perception and reasoning spaces during continual learning.
Introduces CPM for adaptive merging of cross-modal visual representations to recover perceptual alignment.
Presents LPM for recursively merging low-rank weight matrices, ensuring reasoning stability with an optimal fusion.
Operates as a training-free paradigm using efficient CPU-based algebraic operations for knowledge merging.

Why it matters

This paper tackles catastrophic forgetting in MLLMs, a major challenge for sequential task adaptation. MAny's training-free approach offers an efficient and effective solution, significantly improving performance and robustness. Its novel merging strategies for both perception and reasoning provide a robust path for continual learning.

Original Abstract

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers