EffiMiniVLM: A Compact Dual-Encoder Regression Framework

April 3, 20262604.03172

Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum

cs.CV

TLDR

EffiMiniVLM is a compact, efficient dual-encoder VLM for product quality prediction, outperforming larger models with significantly fewer resources.

Key contributions

Proposes EffiMiniVLM, a compact dual-encoder VLM using EfficientNet-B0 and MiniLM for product quality prediction.
Introduces a weighted Huber loss to improve training sample efficiency by leveraging rating counts.
Achieves competitive performance with 4-8x higher resource efficiency than top models, using only 20% of data.
The only benchmarked approach that does not rely on extensive external datasets.

Why it matters

This paper introduces a highly efficient and compact vision-language model for product quality prediction in cold-start scenarios. It significantly reduces computational costs and data requirements while maintaining competitive performance, making advanced multimodal analysis more accessible.

Original Abstract

Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers