SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
TLDR
SIMMER introduces a novel MLLM-based single encoder for cross-modal food image-recipe retrieval, achieving state-of-the-art performance.
Key contributions
- Replaces dual-encoder systems with a single MLLM-based encoder (VLM2Vec) for image-recipe retrieval.
- Designs tailored prompt templates to effectively process structured recipe components (title, ingredients, instructions).
- Introduces component-aware data augmentation, improving robustness to incomplete recipe inputs.
- Achieves state-of-the-art results on Recipe1M, boosting image-to-recipe R@1 from 81.8% to 87.5% (1k setting).
Why it matters
This paper significantly advances cross-modal food image-recipe retrieval by simplifying the architecture with a unified MLLM encoder. It overcomes limitations of complex dual-encoder systems, making the task more efficient and robust. The improved performance has practical implications for nutritional management and cooking assistance tools.
Original Abstract
Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.