Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval

May 7, 20262605.06083

Jun Li, Peifeng Lai, Xuhang Lou, Jinpeng Wang, Yuting Wang + 3 more

cs.CVcs.IRcs.LGcs.MM

TLDR

Holmes introduces a hierarchical evidential learning framework to explicitly model and quantify uncertainty in partially relevant video retrieval, outperforming SOTA.

Key contributions

Proposes Holmes, a hierarchical evidential learning framework for partially relevant video retrieval.
Models inter-video similarity as evidential support with Dirichlet distributions to quantify uncertainty.
Uses fine-grained query identification and query-adaptive calibrated learning via a three-fold principle.
Employs soft query-clip alignment via flexible optimal transport to alleviate sparse temporal supervision.

Why it matters

This paper tackles the inherent uncertainty in partially relevant video retrieval, where brief queries often lead to ambiguity. Holmes explicitly models this uncertainty using a hierarchical evidential learning framework. This significantly improves retrieval accuracy and robustness, making video search more effective for vague or incomplete queries.

Original Abstract

Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers