Interactive Multi-Turn Retrieval for Health Videos

May 2, 20262605.01409

Chengzheng Wu, Ke Qiu, Baoming Zhang, Ruiyu Mao, Xulong Tang + 1 more

cs.IRcs.CVcs.MM

TLDR

This paper introduces interactive multi-turn retrieval for health videos, proposing a new corpus and a two-stage framework for better search.

Key contributions

Addresses limitations of single-turn health video retrieval for complex, evolving information needs.
Introduces MHVRC, a new Multi-Turn Health Video Retrieval Corpus for interactive health video search.
Proposes DATR, a Dialogue-Aware Two-Stage Retrieval framework for efficient and accurate re-ranking.
Demonstrates consistent gains over baselines and better capture of fine-grained procedural semantics.

Why it matters

Health video retrieval needs to be interactive to handle complex, evolving user queries, especially in clinical training or patient education. This work provides a crucial benchmark and a scalable technical solution. It significantly improves the precision of finding relevant health information.

Original Abstract

The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers