Meta-CoT: Enhancing Granularity and Generalization in Image Editing

April 27, 20262604.24625

Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He + 6 more

cs.CVcs.AIcs.LGcs.MM

TLDR

Meta-CoT enhances image editing granularity and generalization by decomposing tasks into a (task, target, ability) triplet and five fundamental meta-tasks.

Key contributions

Decomposes editing operations into a (task, target, understanding ability) triplet for fine-grained understanding.
Breaks down editing tasks into five fundamental meta-tasks to achieve strong generalization to unseen edits.
Introduces a CoT-Editing Consistency Reward to align model behavior with Chain-of-Thought reasoning.
Achieves 15.8% overall improvement across 21 tasks and generalizes effectively to unseen editing tasks.

Why it matters

Meta-CoT significantly advances image editing by improving both understanding granularity and generalization. Its novel two-level decomposition and consistency reward yield a 15.8% performance boost across tasks. This efficient approach generalizes effectively to unseen tasks, impacting future multi-modal models.

Original Abstract

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers