AesRM: Improving Video Aesthetics with Expert-Level Feedback
Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li + 5 more
TLDR
AesRM introduces a hierarchical rubric and expert-annotated dataset to improve video aesthetics, outperforming baselines and enhancing video generation.
Key contributions
- Proposes a hierarchical rubric for video aesthetics with 15 fine-grained criteria (VA, VF, VP).
- Creates AesVideo-Bench, a large-scale expert-annotated dataset of ~2500 video pairs for evaluation.
- Develops AesRM, a family of Video Aesthetic Reward Models, including AesRM-CoT for interpretable assessments.
- Trains AesRM with a three-stage progressive scheme and self-consistency-based CoT synthesis.
Why it matters
This paper addresses the gap in rigorous video aesthetic evaluation, moving beyond simple visual pleasure. By providing a detailed rubric and expert-annotated data, AesRM offers a robust and interpretable method to improve video generation quality for real-world applications like filmmaking.
Original Abstract
Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. Prior work on visual aesthetics largely focuses on images, often reducing aesthetics to coarse definitions, e.g., visual pleasure, without a rigorous and systematic evaluation. To improve video aesthetics, we propose a hierarchical rubric that decomposes video aesthetics into three core dimensions, Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP), with 15 fine-grained criteria, e.g., shot composition. This framework enables a large-scale expert-annotated preference dataset and an evaluation benchmark, AesVideo-Bench, containing about 2500 video pairs with expert annotations on VA, VF, and VP. We then build a family of Video Aesthetic Reward Models (AesRM): AesRM-Base, which directly predicts pairwise preferences on these dimensions to provide efficient post-training rewards, and AesRM-CoT, which additionally generates CoT aligned with all 15 criteria to improve assessment interpretability. Specifically, we train AesRM with a three-stage progressive scheme: (1) Atomic Aesthetic Capability Learning, which strengthens AesRM's recognition of fundamental aesthetic concepts, e.g., accurately identifying centered composition; (2) Cold-Start, aligning the model with structured reasoning protocols; and (3) GRPO, further improving evaluation accuracy. To enhance AesRM-CoT, we additionally propose self-consistency-based CoT synthesis to improve CoT quality and design CoT-based process rewards during GRPO. Extensive experiments show AesRM outperforms baselines on multiple aesthetics benchmarks and is more robust, with lower position bias. Finally, we align Wan2.2 with AesRM and observe clear aesthetic gains over existing aesthetic reward models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.