Official repo of paper for "CamI2V: Camera-Controlled Image-to-Video Diffusion Model".
Abstract: Recent advancements have integrated camera pose as a user-friendly and physics-informed condition in video diffusion models, enabling precise camera control. In this paper, we identify one of the key challenges as effectively modeling noisy cross-frame interactions to enhance geometry consistency and camera controllability. We innovatively associate the quality of a condition with its ability to reduce uncertainty and interpret noisy cross-frame features as a form of noisy condition. Recognizing that noisy conditions provide deterministic information while also introducing randomness and potential misguidance due to added noise, we propose applying epipolar attention to only aggregate features along corresponding epipolar lines, thereby accessing an optimal amount of noisy conditions. Additionally, we address scenarios where epipolar lines disappear, commonly caused by rapid camera movements, dynamic objects, or occlusions, ensuring robust performance in diverse environments. Furthermore, we develop a more robust and reproducible evaluation pipeline to address the inaccuracies and instabilities of existing camera control metrics. Our method achieves a 25.64% improvement in camera controllability on the RealEstate10K dataset without compromising dynamics or generation quality and demonstrates strong generalization to out-of-domain images. Training and inference require only 24GB and 12GB of memory, respectively, for 16-frame sequences at 256×256 resolution. We will release all checkpoints, along with training and evaluation code. Dynamic videos are available for viewing on our project page.
- 🔥 2025/01/02: Release checkpoint of CamI2V (512x320, 50k), which is suitable for research propose and comparison. We plan to release a more advanced model with longer training soon.
- 🔥 2024/12/24: Integrate Qwen2-VL in gradio demo, you can now caption your own input image by this powerful VLM.
- 🔥 2024/12/23: Release checkpoint of CamI2V (256x256, 50k).
- 🔥 2024/12/16: Release non-officially reproduced checkpoints of MotionCtrl (256x256, 50k) and CameraCtrl (256x256, 50k) on DynamiCrafter.
- 🔥 2024/12/09: Release training configs and scripts.
- 🔥 2024/12/06: Release dataset pre-process code for RealEstate10K.
- 🔥 2024/12/02: Release evaluation code for RotErr, TransErr, CamMC and FVD.
- 🌱 2024/11/16: Release model code of CamI2V for training and inference, including implementation for MotionCtrl and CameraCtrl.
Measured under 256x256 resolution, 16 frames, 25steps.
Method | RotErr |
TransErr |
CamMC |
FVD (VideoGPT) |
FVD (StyleGAN) |
---|---|---|---|---|---|
DynamiCrafter | 3.3415 | 9.8024 | 11.625 | 106.02 | 92.196 |
+ MotionCtrl | 0.8636 | 2.5068 | 2.9536 | 70.820 | 60.363 |
+ Plucker Embedding (Baseline, CameraCtrl) |
0.7098 | 1.8877 | 2.2557 | 66.077 | 55.889 |
+ Plucker Embedding + Epipolar Attention Only on Reference Frame (CamCo-like) |
0.5738 | 1.6014 | 1.8851 | 66.439 | 56.778 |
+ Plucker Embedding + Epipolar Attention (Our CamI2V) |
0.4758 | 1.4955 | 1.7153 | 66.090 | 55.701 |
+ Plucker Embedding + 3D Full Attention |
0.6299 | 1.8215 | 2.1315 | 71.026 | 60.00 |
-
Our method demonstrates significant improvements over CameraCtrl, achieving a 32.96% reduction in Rotation Error, a 25.64% decrease in CamMC, and a 20.77% improvement in Translation Error, without decrease in FVD. These results were obtained using text and image CFG set to 7.5, 25 steps, and camera CFG set to 1.0 (no camera CFG).
-
Compared with CamCo-like (arXiv in June) method, we improve 17.08%, 6.61%, 9.00% on RotErr, TransErr, and CamMC without FVD decrease, respectively.
Method | # Parameters | GPU Memory | Generation Time (RTX 3090) |
---|---|---|---|
DynamiCrafter | 1.4 B | 11.14 GiB | 8.14 s |
+ MotionCtrl | + 63.4 M | 11.18 GiB | 8.27 s |
+ Plucker Embedding (Baseline, CameraCtrl) |
+ 211 M | 11.56 GiB | 8.38 s |
+ Plucker Embedding + Epipolar Attention (Our CamI2V) |
+ 261 M | 11.67 GiB | 10.3 s |
conda create -n cami2v python=3.10
conda activate cami2v
conda install -y pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y xformers -c xformers
pip install -r requirements.txt
Model | Resolution | Training Steps |
---|---|---|
CamI2V | 512x320 | 50k |
CamI2V | 256x256 | 50k |
CameraCtrl | 256x256 | 50k |
MotionCtrl | 256x256 | 50k |
Currently we release checkpoints of DynamiCrafter-based CamI2V (256x256, 512x320), CameraCtrl (256x256) and MotionCtrl (256x256), with 50k training steps.
Download above checkpoints and put under ckpts
folder.
Please edit ckpt_path
in configs/models.json
if you have a different model path.
Optional, not required but recommend.
It is used to caption a custom image in gradio demo for video generaion.
We prefer a quantized version of Qwen2-VL due to speed and GPU memory, like GPTQ-Int8 or AWQ in official repo.
Download the pre-trained model and put under pretrained_models
folder:
─┬─ pretrained_models/
└─── Qwen2-VL-7B-Instruct-AWQ/
python cami2v_gradio_app.py --use_qwenvl_captioner
Gradio may struggle to establish network connection, please re-try with --use_host_ip
.
Please follow instructions in datasets folder in this repo to download RealEstate10K dataset and pre-process necessary items like video_clips
and valid_metadata
.
Download pretrained weights of base model DynamiCrafter (256x256, 512x320) and put under pretrained_models
folder:
─┬─ pretrained_models/
├─┬─ DynamiCrafter/
│ └─── model.ckpt
└─┬─ DynamiCrafter_512/
└─── model.ckpt
Start training by passing config yaml to --base
argument of main/trainer.py
. Example training configs are provided in configs
folder.
torchrun --standalone --nproc_per_node 8 main/trainer.py --train \
--logdir $(pwd)/logs \
--base configs/<YOUR_CONFIG_NAME>.yaml \
--name <YOUR_LOG_NAME>
We calculate RotErr, TransErr, CamMC and FVD to evaluate camera controllability and visual quality. Code and installation guide for requirements are provided in evaluation folder, including COLMAP and GLOMAP. Support for VBench is planned in months as well.
CameraCtrl: https://github.com/hehao13/CameraCtrl
MotionCtrl: https://github.com/TencentARC/MotionCtrl
DynamiCrafter: https://github.com/Doubiiu/DynamiCrafter
@article{zheng2024cami2v,
title={CamI2V: Camera-Controlled Image-to-Video Diffusion Model},
author={Zheng, Guangcong and Li, Teng and Jiang, Rui and Lu, Yehao and Wu, Tao and Li, Xi},
journal={arXiv preprint arXiv:2410.15957},
year={2024}
}