ArXiv TLDR

CMedTEB & CARE: Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

🐦 Tweet
2604.10937

Angqing Jiang, Jianlyu Chen, Zhe Fang, Yongcan Wang, Xinpeng Li + 2 more

cs.IR

TLDR

This paper introduces CMedTEB, a high-fidelity Chinese medical text embedding benchmark, and CARE, an asymmetric retriever achieving SOTA performance with low latency.

Key contributions

  • Introduces CMedTEB, a high-fidelity Chinese medical text embedding benchmark for retrieval, reranking, and STS.
  • CMedTEB is curated using a multi-LLM voting pipeline validated by clinical experts to ensure gold-standard quality.
  • Proposes CARE, an asymmetric retriever pairing a lightweight BERT query encoder with a powerful LLM document encoder.
  • Develops a novel two-stage training strategy to optimize CARE's asymmetric encoders for improved performance.

Why it matters

The paper addresses the critical need for both accurate and low-latency medical text retrieval, especially in Chinese. It provides a much-needed, high-quality benchmark (CMedTEB) and an innovative model (CARE) that overcomes the computational limitations of LLMs for real-time applications. This work significantly advances efficient and reliable medical information access.

Original Abstract

Effective medical text retrieval requires both high accuracy and low latency. While LLM-based embedding models possess powerful retrieval capabilities, their prohibitive latency and high computational cost limit their application in real-time scenarios. Furthermore, the lack of comprehensive and high-fidelity benchmarks hinders progress in Chinese medical text retrieval. In this work, we introduce the Chinese Medical Text Embedding Benchmark (CMedTEB), a benchmark spanning three kinds of practical embedding tasks: retrieval, reranking, and semantic textual similarity (STS). Distinct from purely automated datasets, CMedTEB is curated via a rigorous multi-LLM voting pipeline validated by clinical experts, ensuring gold-standard label quality while effectively mitigating annotation noise. On this foundation, we propose the Chinese Medical Asymmetric REtriever (CARE), an asymmetric architecture that pairs a lightweight BERT-style encoder for online query encoding with a powerful LLM-based encoder for offline document encoding. However, optimizing such an asymmetric retriever with two structurally different encoders presents distinctive challenges. To address this, we introduce a novel two-stage training strategy that progressively bridges the query and document representations. Extensive experiments demonstrate that CARE surpasses state-of-the-art symmetric models on CMedTEB, achieving superior retrieval performance without increasing inference latency.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.