ArXiv TLDR

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

🐦 Tweet
2604.15628

Keisuke Gomi, Keiji Yanai

cs.CVcs.CLcs.IRcs.LGcs.MM

TLDR

SIMMER introduces a novel MLLM-based single encoder for cross-modal food image-recipe retrieval, achieving state-of-the-art performance.

Key contributions

  • Replaces dual-encoder systems with a single MLLM-based encoder (VLM2Vec) for image-recipe retrieval.
  • Designs tailored prompt templates to effectively process structured recipe components (title, ingredients, instructions).
  • Introduces component-aware data augmentation, improving robustness to incomplete recipe inputs.
  • Achieves state-of-the-art results on Recipe1M, boosting image-to-recipe R@1 from 81.8% to 87.5% (1k setting).

Why it matters

This paper significantly advances cross-modal food image-recipe retrieval by simplifying the architecture with a unified MLLM encoder. It overcomes limitations of complex dual-encoder systems, making the task more efficient and robust. The improved performance has practical implications for nutritional management and cooking assistance tools.

Original Abstract

Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.