OVO-Bench:

How Far is Your Video-LLMs from Real-World Online VideO Understanding?

1Shanghai AI Laboratory, 2Tsinghua University
3Beihang University, 4Communication University of China
5The Chinese University of Hong Kong, 6SenseTime Group
*indicates Equal Contribution; † indicates interns at IXCLab, Shanghai AI Laboratory

Abstract

Temporal Awareness—the ability to reason dynamically based on the timestamp when a question is raised—is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.

Task Taxnomy of OVO-Bench


Online video understanding aims to equip real-world, always-on agents with the ability to receive and process video inputs continuously. We closely mimic human's visual understanding process, which can be depicted as below:

Tasks of different modes are as follows:
  1. Backward Tracing
    • [EPM] Episodic Memory: Backtrack and retrieve key moments from past video inputs.
    • [ASI] Action Sequency Identification: Identify the correct sequence of human actions in the video streams.
    • [HLD] Hallucination Detection: Ask questions irrelevant to existing video inputs.
  2. Real-Time Visual Perception
    • [STU] Spatial Understanding: Reason over the spatial relationships between objects occuring in nearby frames.
    • [OJR] Object Recognition: Recognize the objects appearing in the current frames.
    • [ATR] Attribute Recognition: Identify the characteristics or properties of objects.
    • [ACR] Action Recognition: Recognize and interpret the actions being performed by individuals in current frames.
    • [OCR] Optical Character Recognition: Recognize and interpret characters that appear within the frame.
    • [FTP] Future Prediction: Forecast the most probable subsequent phase of the current scene.
  3. Forward Active Responding
    • [REC] Repetition Event Count: Respond when a repetitive event occurs again.
    • [SSR] Sequential Steps Recognition: Respond when a certain procedure or sequence of actions has transitioned to another stage.
    • [CRR] Clues Reveal Responding: Delay responding until sufficient information or clues are provided.

OVO-Bench Statistics Analysis

OVO-Bench: benchmark designed for online visual understanding. (a) For each video, we densely query Video-LLMs along the video stream to simulate online conversation scene. (b) Our benchmarks features a varied duration of query timestamps and video durations. (c) We include a large propotion of ego-centric video in our data source.

OVO-Bench Leaderboard


The performance of latest mainstream Video-LLMs on our benchmark, including two 2 close-sourced models and 6 representative open-source models.
The best results of each category are highlighted in bold

More Dataset Examples 下载图标

BibTeX


@misc{li2025ovobenchfarvideollmsrealworld,
	title={OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?}, 
	author={Yifei Li and Junbo Niu and Ziyang Miao and Chunjiang Ge and Yuanhang Zhou and Qihao He and Xiaoyi Dong and Haodong Duan and Shuangrui Ding and Rui Qian and Pan Zhang and Yuhang Zang and Yuhang Cao and Conghui He and Jiaqi Wang},
	year={2025},
	eprint={2501.05510},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2501.05510}, 
}