Papers
arxiv:2602.05986

RISE-Video: Can Video Generators Decode Implicit World Rules?

Published on Feb 5
ยท Submitted by
Xue Yang
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

RISE-Video presents a novel benchmark for evaluating text-image-to-video synthesis models based on cognitive reasoning rather than visual fidelity, using a multi-dimensional metric system and automated LMM-based evaluation.

AI-generated summary

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

Community

Paper submitter

Despite strong visual realism, we find that current text-image-to-video models frequently fail to respect implicit world rules when generating complex scenarios. We introduce RISE-Video to systematically evaluate reasoning fidelity in video generation and reveal persistent reasoning gaps across state-of-the-art models.
Code: https://github.com/VisionXLab/Rise-Video
Data: https://huggingface.co/datasets/VisionXLab/RISE-Video

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/rise-video-can-video-generators-decode-implicit-world-rules-4136-2f194534

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05986 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05986 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05986 in a Space README.md to link it from this page.

Collections including this paper 2

Paper page - RISE-Video: Can Video Generators Decode Implicit World Rules?
Papers
arxiv:2602.05986

RISE-Video: Can Video Generators Decode Implicit World Rules?

Published on Feb 5
ยท Submitted by
Xue Yang
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

RISE-Video presents a novel benchmark for evaluating text-image-to-video synthesis models based on cognitive reasoning rather than visual fidelity, using a multi-dimensional metric system and automated LMM-based evaluation.

AI-generated summary

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

Community

Paper submitter

Despite strong visual realism, we find that current text-image-to-video models frequently fail to respect implicit world rules when generating complex scenarios. We introduce RISE-Video to systematically evaluate reasoning fidelity in video generation and reveal persistent reasoning gaps across state-of-the-art models.
Code: https://github.com/VisionXLab/Rise-Video
Data: https://huggingface.co/datasets/VisionXLab/RISE-Video

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/rise-video-can-video-generators-decode-implicit-world-rules-4136-2f194534

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05986 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05986 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05986 in a Space README.md to link it from this page.

Collections including this paper 2