Skip to content

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

License

Notifications You must be signed in to change notification settings

PKU-YuanGroup/Open-Sora-Plan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.

ๆœฌ้กน็›ฎๅธŒๆœ›้€š่ฟ‡ๅผ€ๆบ็คพๅŒบ็š„ๅŠ›้‡ๅค็ŽฐSora๏ผŒ็”ฑๅŒ—ๅคง-ๅ…”ๅฑ•AIGC่”ๅˆๅฎž้ชŒๅฎคๅ…ฑๅŒๅ‘่ตท๏ผŒๅฝ“ๅ‰็‰ˆๆœฌ็ฆป็›ฎๆ ‡ๅทฎ่ทไป็„ถ่พƒๅคง๏ผŒไป้œ€ๆŒ็ปญๅฎŒๅ–„ๅ’Œๅฟซ้€Ÿ่ฟญไปฃ๏ผŒๆฌข่ฟŽPull request๏ผ็›ฎๅ‰ไปฃ็ ๅŒๆ—ถๆ”ฏๆŒไฝฟ็”จๅ›ฝไบงAI่ฎก็ฎ—็ณป็ปŸ๏ผˆๅŽไธบๆ˜‡่…พ๏ผ‰่ฟ›่กŒๅฎŒๆ•ด็š„่ฎญ็ปƒๅ’ŒๆŽจ็†ใ€‚ๅŸบไบŽๆ˜‡่…พ่ฎญ็ปƒๅ‡บ็š„ๆจกๅž‹๏ผŒไนŸๅฏ่พ“ๅ‡บๆŒๅนณไธš็•Œ็š„่ง†้ข‘่ดจ้‡ใ€‚

arXiv arXiv License
slack badge WeChat badge Twitter Modelers
GitHub repo starsย  GitHub repo forksย  GitHub repo watchersย  GitHub repo size
GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues

PKU-YuanGroup%2FOpen-Sora-Plan | Trendshift
If you like our project, please give us a star โญ on GitHub for latest update.

๐Ÿ“ฃ News

  • COMING SOON โšก๏ธโšก๏ธโšก๏ธ For large model parallelisation training, TP & SP and more strategies are coming...

    ่ฟ‘ๆœŸๅฐ†ๆ–ฐๅขžๅŽไธบๆ˜‡่…พๅคšๆจกๆ€MindSpeed-MMๅˆ†ๆ”ฏ๏ผŒๅ€ŸๅŠฉๅŽไธบMindSpeed-MMๅฅ—ไปถ็š„่ƒฝๅŠ›ๆ”ฏๆ’‘Open-Sora Planๅ‚ๆ•ฐ็š„ๆ‰ฉๅขž๏ผŒไธบๆ›ดๅคงๅ‚ๆ•ฐ่ง„ๆจก็š„ๆจกๅž‹่ฎญ็ปƒๆไพ›TPใ€SP็ญ‰ๅˆ†ๅธƒๅผ่ฎญ็ปƒ่ƒฝๅŠ›ใ€‚

  • [2024.12.03] โšก๏ธ We released our arxiv paper and WF-VAE paper for v1.3. The next more powerful version is coming soon.

  • [2024.10.16] ๐ŸŽ‰ We released version 1.3.0, featuring: WFVAE, prompt refiner, data filtering strategy, sparse attention, and bucket training strategy. We also support 93x480p within 24G VRAM. More details can be found at our latest report.

  • [2024.08.13] ๐ŸŽ‰ We are launching Open-Sora Plan v1.2.0 I2V model, which is based on Open-Sora Plan v1.2.0. The current version supports image-to-video generation and transition generation (the starting and ending frames conditions for video generation). Check out the Image-to-Video section in this report.

  • [2024.07.24] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Check out our latest report.

  • [2024.05.27] ๐ŸŽ‰ We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.

  • [2024.04.09] ๐Ÿค Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.

  • [2024.04.07] ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.

  • [2024.03.27] ๐Ÿš€๐Ÿš€๐Ÿš€ We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.

  • [2024.03.01] ๐Ÿค— We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch ๐Ÿ‘€ this repository for the latest updates.

๐Ÿ˜ Gallery

Text & Image to Video Generation.

Demo Video of Open-Sora Plan V1.3

๐Ÿ˜ฎ Highlights

Open-Sora Plan shows excellent performance in video generation.

๐Ÿ”ฅ High performance CausalVideoVAE, but with lower training cost

  • High compression ratio with excellent performance, capable of compressing videos by 256 times (4ร—8ร—8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.

๐Ÿš€ Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

  • With a new sparse attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.

๐Ÿค— Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command.

python -m opensora.serve.gradio_web_server --model_path "path/to/model" \
    --ae WFVAEModel_D8_4x8x8 --ae_path "path/to/vae" \
    --caption_refiner "path/to/refiner" \
    --text_encoder_name_1 "path/to/text_enc" --rescale_betas_zero_snr

ComfyUI

Coming soon...

๐Ÿณ Resource

Version Architecture Diffusion Model CausalVideoVAE Data Prompt Refiner
v1.3.0 [4] Skiparse 3D Anysize in 93x640x640[3], Anysize in 93x640x640_i2v[3] Anysize prompt_refiner checkpoint
v1.2.0 Dense 3D 93x720p, 29x720p[1], 93x480p[1,2], 29x480p, 1x480p, 93x480p_i2v Anysize Annotations -
v1.1.0 2+1D 221x512x512, 65x512x512 Anysize Data and Annotations -
v1.0.0 2+1D 65x512x512, 65x256x256, 17x256x256 Anysize Data and Annotations -

[1] Please note that the weights for v1.2.0 29ร—720p and 93ร—480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

[2] We fine-tuned 3.5k steps from 93ร—720p to get 93ร—480p for community research use.

[3] The model is trained arbitrarily on stride=32. So keep the resolution of the inference a multiple of 32. Frames need to be 4n+1, e.g. 93, 77, 61, 45, 29, 1 (image).

[4] Model weights are also available at OpenMind and WiseModel.

Warning

๐Ÿšจ For version 1.2.0, we no longer support 2+1D models.

โš™๏ธ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
  1. Install required packages We recommend the requirements as follows.
  • Python >= 3.8
  • Pytorch >= 2.1.0

GPU

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

NPU

pip install torch_npu==2.1.0.post6
# ref https://github.com/dmlc/decord
git clone --recursive https://github.com/dmlc/decord
mkdir build && cd build 
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release -DFFMPEG_DIR=/usr/local/ffmpeg 
make 
cd ../python 
pwd=$PWD 
echo "PYTHONPATH=$PYTHONPATH:$pwd" >> ~/.bashrc 
source ~/.bashrc 
python3 setup.py install --user
  1. Install optional requirements such as static type checking:
pip install -e '.[dev]'

๐Ÿ—๏ธ Training & Inferencing

๐Ÿ—œ๏ธ CausalVideoVAE

The data preparation, training, inferencing and evaluation can be found here

๐Ÿ“– Prompt Refiner

The data preparation, training, inferencing can be found here

๐Ÿ“œ Text-to-Video

The data preparation, training and inferencing can be found here

๐Ÿ–ผ๏ธ Image-to-Video

The data preparation, training and inferencing can be found here

โšก๏ธ Extra Save Memory

๐Ÿ”† Training

During training, the entire EMA model remains in VRAM. You can enable --offload_ema or disable --use_ema. Additionally, VAE tiling is disabled by default, but you can pass --enable_tiling or disable --vae_fp32. Finally, a temporary but extreme saving memory option is enable --extra_save_mem to offload the text encoder and VAE to the CPU when not in use, though this will significantly slow down performance.

We currently have two plans: one is to continue using the Deepspeed/FSDP approach, sharding the EMA and text encoder across ranks with Zero3, which is sufficient for training 10-15B models. The other is to adopt MindSpeed for various parallel strategies, enabling us to scale the model up to 30B.

โšก๏ธ 24G VRAM Inferencing

Please first ensure that you understand how to inference. Refer to the inference instructions in Text-to-Video. Simply specify --save_memory, and during inference, enable_model_cpu_offload(), enable_sequential_cpu_offload(), and vae.vae.enable_tiling() will be automatically activated.

๐Ÿ’ก How to Contribute

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

๐Ÿ‘ Acknowledgement and Related Work

  • Allegro: Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input based on our Open-Sora Plan. The significance of open-source is becoming increasingly tangible.
  • Latte: It is a wonderful 2+1D video generation model.
  • PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
  • ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
  • VideoGPT: Video Generation using VQ-VAE and Transformers.
  • DiT: Scalable Diffusion Models with Transformers.
  • FiT: Flexible Vision Transformer for Diffusion Model.
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

๐Ÿ”’ License

โœจ Star History

Star History

โœ๏ธ Citing

@article{lin2024open,
  title={Open-Sora Plan: Open-Source Large Video Generation Model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}
@article{li2024wf,
  title={WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model},
  author={Li, Zongjian and Lin, Bin and Ye, Yang and Chen, Liuhan and Cheng, Xinhua and Yuan, Shenghai and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17459},
  year={2024}
}

๐Ÿค Community contributors

About

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages