Skyra

AI-Generated Video Detection via Grounded Artifact Reasoning

Department of Automation, Tsinghua University
Project Leader    * Corresponding Author
Skyra Teaser

Skyra identifies human-perceivable visual artifacts (e.g., shape distortion, physics violations) in AI-generated videos and leverages them as grounded evidence for both detection and explanation.

Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, providing valuable insights for advancing explainable AI-generated video detection.

Fine-Grained Artifact Taxonomy & Dataset

Artifact Taxonomy

We propose a hierarchical taxonomy (L1: Low-level Forgery & Violation of Laws) containing categories like Physics Violation, Object Inconsistency, and Texture Anomaly to support fine-grained reasoning.


ViF-CoT-4K Dataset

Dataset Statistics Dataset Statistics

ViF-CoT-4K is the first large-scale dataset with manual fine-grained artifact annotations. It includes:

  • Diverse Generators: Sora-2, Wan2.1, Kling, CogVideoX, etc.
  • Grounded CoT: Step-by-step reasoning chains with timestamps and bounding boxes.
  • High Quality: Reduced real-fake discrepancies to prevent shortcut learning.

Skyra Framework: From SFT to RL

Skyra Method

We employ a two-stage training pipeline:

  1. Cold-Start Initialization (SFT): Using ViF-CoT-4K to endow the model with basic perception of artifacts and grounded reasoning capabilities.
  2. Reinforcement Learning (RL): Using Group Relative Policy Optimization (GRPO) with a custom Asymmetric Reward mechanism. This encourages the model to actively explore artifacts (high penalty for missing fakes) while maintaining strict format verification.

Experimental Results

Radar Chart Performance

Performance comparison on ViF-Bench across various generators.

Skyra achieves state-of-the-art performance on the proposed ViF-Bench and the external GenVideo benchmark.

Method Type Acc (%) F1 Score
DeMamba Binary 64.29 73.00
GPT-4.1-mini MLLM 54.08 24.21
Gemini-2.5-flash MLLM 53.36 57.48
BusterX++ MLLM-based 56.90 21.94
Skyra (Ours-RL) MLLM-based 91.02 90.27

[cite_start]Comparison on ViF-Bench (Mean). Skyra significantly outperforms both binary detectors and generic MLLMs. [cite: 378]

Grounded Reasoning Examples

Skyra provides detailed textual explanations and spatial-temporal grounding (bounding boxes & timestamps) for its decisions.

BibTeX

@misc{li2025skyraaigeneratedvideodetection,
      title={Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning}, 
      author={Yifei Li and Wenzhao Zheng and Yanran Zhang and Runze Sun and Yu Zheng and Lei Chen and Jie Zhou and Jiwen Lu},
      year={2025},
      eprint={2512.15693},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15693}, 
}