Skyra: AI-Generated Video Detection

Abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, providing valuable insights for advancing explainable AI-generated video detection.

Fine-Grained Artifact Taxonomy & Dataset

We propose a hierarchical taxonomy (L1: Low-level Forgery & Violation of Laws) containing categories like Physics Violation, Object Inconsistency, and Texture Anomaly to support fine-grained reasoning.

ViF-CoT-4K Dataset

ViF-CoT-4K is the first large-scale dataset with manual fine-grained artifact annotations. It includes:

Diverse Generators: Sora-2, Wan2.1, Kling, CogVideoX, etc.
Grounded CoT: Step-by-step reasoning chains with timestamps and bounding boxes.
High Quality: Reduced real-fake discrepancies to prevent shortcut learning.

Skyra Framework: From SFT to RL

We employ a two-stage training pipeline:

Cold-Start Initialization (SFT): Using ViF-CoT-4K to endow the model with basic perception of artifacts and grounded reasoning capabilities.
Reinforcement Learning (RL): Using Group Relative Policy Optimization (GRPO) with a custom Asymmetric Reward mechanism. This encourages the model to actively explore artifacts (high penalty for missing fakes) while maintaining strict format verification.

Experimental Results

Performance comparison on ViF-Bench across various generators.

Skyra achieves state-of-the-art performance on the proposed ViF-Bench and the external GenVideo benchmark.

Method	Type	Acc (%)	F1 Score
DeMamba	Binary	64.29	73.00
GPT-4.1-mini	MLLM	54.08	24.21
Gemini-2.5-flash	MLLM	53.36	57.48
BusterX++	MLLM-based	56.90	21.94
Skyra (Ours-RL)	MLLM-based	91.02	90.27

[cite_start]Comparison on ViF-Bench (Mean). Skyra significantly outperforms both binary detectors and generic MLLMs. [cite: 378]

BibTeX

@article{li2025skyra,
  title={Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning},
  author={Li, Yifei and Zheng, Wenzhao and Zhang, Yanran and Sun, Runze and Zheng, Yu and Chen, Lei and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2512.15693},
  year={2025}
}

Skyra

AI-Generated Video Detection via Grounded Artifact Reasoning

Skyra identifies human-perceivable visual artifacts (e.g., shape distortion, physics violations) in AI-generated videos and leverages them as grounded evidence for both detection and explanation.

Abstract

Fine-Grained Artifact Taxonomy & Dataset

ViF-CoT-4K Dataset

Skyra Framework: From SFT to RL

Experimental Results

Grounded Reasoning Examples

Detection of Shape Distortion (Hands)

Violation of Physical Laws (Fluid Dynamics)

Abnormal Object Disappearance

Reasoning on Real Videos (No Artifacts)

Abnormal Object Disappearance

Abnormal Object Disappearance

Abnormal Object Disappearance

Abnormal Object Disappearance

Abnormal Object Disappearance

Abnormal Object Disappearance

BibTeX