This webpage contains multiple ultra-high-definition videos, and they may load slowly. Please be patient, thanks! If loading fails, you can download the videos manually from this [link].
We present Vivid-VR, a generative video restoration method built upon an advanced DiT-based T2V model (i.e., CogVideoX1.5-5B), where ControlNet is leveraged to control the generation process, ensuring content consistency. Limited by training data diversity and multimodal alignment, fine-tuning the controllable generation pipeline often leads to drift, degrading texture realism and temporal coherence in the output. To tackle this challenge, we propose a concept distillation training strategy that leverages the T2V base model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. For more effective control, we design an adaptive feature modulation mechanism, comprising a control feature projector and a new ControlNet connector, which not only enable precise control but also filters out low-quality input features to mitigate their adverse effects. Furthermore, we introduce a mutually exclusive fusion-based aggregation sampling strategy to resolve boundary artifacts in DiT's aggregated sampling, enabling high-resolution video inference with controllable memory usage. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as in AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency.
Given a low-quality input video, Vivid-VR first generates a text description using CogVLM2-Video and encodes it into text tokens via the T5 encoder. Simultaneously, the input video is encoded into video latent by the 3D VAE encoder, which is further processed by the proposed control feature projector to remove low-quality features. The video latent is then patchified and noise is added to produce visual tokens. The text tokens, visual tokens, and timestep are fed into DiT and ControlNet. To improve controllability, we design a two-branch connector: one branch uses an MLP for feature mapping, while the other branch employs the cross-attention mechanism to dynamically retrieve the most relevant control features, enabling adaptive alignment with the input video during generation. After T denoising steps, the denoised latent is decoded by the 3D VAE to generate the final high-quality video. The control feature projector, ControlNet, and connectors are trained using the proposed concept distillation training strategy, and the remaining parameters are frozen.
To better evaluate the proposed method, we propose two testsets, named UGC50 and AIGC50, which contain 50 real-world UGC videos and 50 AIGC videos respectively. You can download these two testsets from this [link].
For synthetic testsets, we employ full-reference metrics (PSNR, SSIM, LPIPS, and DISTS) and no-reference metrics (NIQE, CLIP-IQA, MUSIQ, DOVER, and MD-VQA). For real-world and AIGC testset, we only adopt no-reference metrics due to the absence of ground truth. Experimental results show that Vivid-VR achieves almost the best results on no-reference metrics.
@misc{bai2025vividvr,
title={Vivid-VR: Distilling Concepts from Diffusion Transformer for Photorealistic Video Restoration},
author={Haoran Bai and Xiaoxu Chen and Canqian Yang and Zongyao He and Sibin Deng and Ying Chen},
year={2025},
url={https://github.com/csbhr/Vivid-VR}
}