Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge + 4 more
TLDR
This paper introduces a framework for online video understanding that provides transparent decisions and precisely times responses to visual evidence.
Key contributions
- Proposes a novel framework decoupling reasoning control from memory integration for online video analysis.
- Active Thinking Decision Maker (ATDM) enables transparent decisions with observable progress and confidence metrics.
- ATDM precisely times responses to align with the first sufficient evidence timestamp in video streams.
- Hierarchical Progressive Semantic Integration (HPSI) provides an efficient, token-budgeted global memory system.
Why it matters
Conventional video LLMs struggle with real-time, transparent, and evidence-aligned responses in streaming environments. This paper addresses these critical limitations, enabling visual agents to operate effectively in dynamic, online settings with improved accuracy and decision clarity.
Original Abstract
Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbolρ$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.