OVBench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Published in Arixv Preprint, 2024

Integrating past information and adapting to continuous video input are pivotal for human-level video understanding. Current benchmarks, however, focus on coarse-grained, video-level question-answering in offline settings, limiting real-time processing and adaptability for practical applications.To this end, we introduce OVBench (Online-Video-Benchmark), which assesses online video understanding through three modes: (1) Backward Tracing, (2) Real-Time Visual Perception, and (3) Forward Active Responding. OVBench consists of 12 tasks, comprising about 2,800 meta-annotations with fine-grained, event-level timestamps paired with 858 videos across 10 domains,encompassing egocentric activities, virtual gaming worlds, and cinematic scenes. To minimize bias, we employ automated generation pipelines and human annotation for meticulous curation.We design an effective problem generation and evaluation pipeline based on these high-quality samples and densely query Video-LLMs across the video streaming timeline. Extensive evaluations of nine Video-LLMs reveal that despite rapid advancements and improving performance on traditional benchmarks, existing models struggle with online video understanding. Our comprehensive evaluation reveals that the best-performing models still have a significant gap compared to human agents in online video understanding.We anticipate that OVBench will guide the development of Video-LLMs towards practical real-world applications and inspire future research in online video understanding. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVBench.

Recommended citation: Your Name, You. (2024). "Paper Title Number 3." GitHub Journal of Bugs. 1(3).
Download Paper