OutDreamer: Video Outpainting with a Diffusion Transformer

Linhao Zhong^1,*, Fan Li^2,4,*,‡, Yi Huang³, Jianzhuang Liu³, Renjing Pei², Fenglong Song²

¹ Zhejiang University, China

² Huawei Noah's Ark Lab, China

³ Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China

⁴ Nankai University, China

^* Equal contribution

^‡ Corresponding author

Code arXiv Project

Abstract

Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model's adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.

Results

Short Video Outpainting Results

fps=8

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Long Video Outpainting Results

fps=20

Input Video

Outpainting Result

Input Video

Outpainting Result

Input Video

Outpainting Result

Citation

If you find our work useful, please consider citing:

@article{zhong2026outdreamer,
  title={Outdreamer: Video outpainting with a diffusion transformer},
  author={Zhong, Linhao and Li, Fan and Huang, Yi and Liu, Jianzhuang and Pei, Renjing and Song, Fenglong},
  journal={IEEE Transactions on Image Processing},
  year={2026},
  publisher={IEEE}
}