Support SFT using ZeRO #654

OleehyO · 2025-01-11T02:18:01Z

No description provided.

When loading videos with fewer frames than max_num_frames, repeat the last frame to reach the required length instead of failing. This ensures consistent tensor dimensions across the dataset while preserving as much original video content as possible.

- Add SFT (Supervised Fine-Tuning) trainers for all model variants: - CogVideoX I2V and T2V - CogVideoX-1.5 I2V and T2V - Add DeepSpeed ZeRO configuration files: - ZeRO-2 with and without CPU offload - ZeRO-3 with and without CPU offload - Add base accelerate config for distributed training - Update trainer.py to support SFT training mode This enables full-parameter fine-tuning with memory-efficient distributed training using DeepSpeed ZeRO optimization.

- Add DeepSpeed ZeRO-3 configuration support - Optimize memory usage during training - Rename training scripts to reflect ZeRO usage - Update related configuration files and trainers

zRzRzRzRzRzRzR · 2025-01-12T11:28:04Z

finetune/README.md

Can we fine-tune on 4 A100 cards, with each card using less than 40G of memory when 8 cards are in use?

- Fix LoRA loading by specifying 'transformer' component - Swap width/height order in RESOLUTION_MAP to match actual usage

- Remove redundant comments and debug information - Adjust default parameters in training scripts - Clean up code in lora_trainer and trainer implementations

zRzRzRzRzRzRzR · 2025-01-19T13:25:27Z

Currently, there doesn't seem to be any major issue, let's proceed with the merge first.

OleehyO added 5 commits January 8, 2025 02:14

fix: remove copying first video frame as conditioning image

f6d722c

fix: pad latent frames to match patch_size_t requirements

e213b6c

Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev

2f275e8

zRzRzRzRzRzRzR mentioned this pull request Jan 11, 2025

finetune CogVideoX frames #647

Open

zRzRzRzRzRzRzR and others added 7 commits January 11, 2025 12:53

add comment as #653

7dc8516

feat: support DeepSpeed ZeRO-3 and optimize peak memory usage

fdb9820

- Add DeepSpeed ZeRO-3 configuration support - Optimize memory usage during training - Rename training scripts to reflect ZeRO usage - Update related configuration files and trainers

Rename lora training scripts as ddp

795dd14

docs: add SFT support documentation in multilingual README

f516938

Add pydantic dependency

3252614

Merge remote-tracking branch 'upstream/main' into dev

30ba108

fix: normalize image tensors in I2VDataset

b362663

zRzRzRzRzRzRzR mentioned this pull request Jan 12, 2025

Work plan and enhancement / 工作计划和用户诉求 #194

Open

OleehyO and others added 3 commits January 12, 2025 08:50

chore: update default training configurations

70c899f

Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev

86a0226

add pipeline

1534bf3

zRzRzRzRzRzRzR reviewed Jan 12, 2025

View reviewed changes

finetune/README.md

Copy link

Member

zRzRzRzRzRzRzR Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fine-tune on 4 A100 cards, with each card using less than 40G of memory when 8 cards are in use?

OleehyO added 3 commits January 13, 2025 10:49

fix: correct LoRA loading and resolution dimensions

4f1cc66

- Fix LoRA loading by specifying 'transformer' component - Swap width/height order in RESOLUTION_MAP to match actual usage

Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev

954ba28

chore: code cleanup and parameter optimization

455b44a

- Remove redundant comments and debug information - Adjust default parameters in training scripts - Clean up code in lora_trainer and trainer implementations

OleehyO force-pushed the CogVideoX_dev branch from f0f8316 to 455b44a Compare January 13, 2025 11:56

zRzRzRzRzRzRzR and others added 7 commits January 13, 2025 20:02

add comment of bash scripts

78275b0

fix: correct do_validation argument parsing

4878edd

zero_to_bf16

7993670

move to tools

4615479

Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev

0e78f20

deps: upgrade diffusers to >=0.32.1

bf9c351

docs: enhance CLI demo documentation

bf73742

zRzRzRzRzRzRzR approved these changes Jan 19, 2025

View reviewed changes

zRzRzRzRzRzRzR merged commit c1ca70b into main Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SFT using ZeRO #654

Support SFT using ZeRO #654

OleehyO commented Jan 11, 2025

zRzRzRzRzRzRzR Jan 12, 2025

zRzRzRzRzRzRzR commented Jan 19, 2025

Support SFT using ZeRO #654

Support SFT using ZeRO #654

Conversation

OleehyO commented Jan 11, 2025

zRzRzRzRzRzRzR Jan 12, 2025

Choose a reason for hiding this comment

zRzRzRzRzRzRzR commented Jan 19, 2025