Skip to content

Latest commit

 

History

History
101 lines (78 loc) · 4.9 KB

README.md

File metadata and controls

101 lines (78 loc) · 4.9 KB

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, and Limin Wang.

arXiv model

teaser

We present p-MoD, a series of efficient MLLMs which features:

  • ✂️ Mixture-of-Depths mechanism, upgraded with tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing).
  • 🎢 Progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer.

📕 Performance and Efficiency

p-MoD matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

teaser

teaser

🛠️ Requirements and Installation

  1. Clone this repository and navigate to the folder
git clone https://github.com/MCG-NJU/p-MoD.git
cd p-MoD
  1. Install packages
conda create -n p-mod python=3.10 -y
conda activate p-mod
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e lmms-eval
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
  1. Login to huggingface and wandb
huggingface-cli login
wandb login

🐯 Model Zoo

Model LLM Epoch Pretrain Data SFT Data
p-MoD-LLaVA-NeXT-7B Vicuna-7B 1 558K 779K
p-MoD-LLaVA-v1.5-7B Vicuna-7B 1 558K 665K

📊 Evaluation

We evaluate our model using lmms-eval. You can use our script ./scripts/lmms-eval/eval.sh, for example:

bash ./scripts/lmms-eval/eval.sh \
  --ckpt MCG-NJU/p-MoD-LLaVA-NeXT-7B \
  --eval_tasks ai2d,chartqa \
  --project_name pmod \
  --run_name pmod-llava-next-7b-ft

🚀 Train

Pretraining

We use the pretrained MLP projector provided by LLaVA, which can be downloaded here. Then put the downloaded model weights under ./checkpoints/llava-v1.5-7b-pretrain/llava-official-checkpoint.

p-MoD-LLaVA-NeXT

First, we provide our python script ./util_scripts/download_llava-next_data.py for data preparation. This script downloads the 779K LLaVA-NeXT data, saves the images under ./playground/data/llava_next_images/ and data json to the path ./playground/data/llava_next_data.json.

Then you can start training using ./scripts/train/finetune_eval_7b_pmod_llava_next.sh.

p-MoD-LLaVA-1.5

First, prepare instruction tuning data following LLaVA-1.5. Download the images from constituting datasets, and the dataset annotation json llava_v1_5_mix_665k.json. Save the images and the json under ./playground/data.

Then, we fix some broken examples in the data json by running the script

python util_scripts/clean_data_json.py \
--original_json_path ./playground/data/llava_v1_5_mix665k.json \
--cleaned_json_path ./playground/data/llava_v1_5_mix665k_cleaned.json

Start training with ./scripts/train/finetune_eval_7b_pmod_llava_1_5.sh.

📄 Citation

If you find our work helpful for your research and applications, please cite our paper:

@article{zhang2024pmod,
  title={p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay},
  author={Zhang, Jun and Meng, Desen and Qi, Ji and Huang, Zhenpeng and Wu, Tao and Wang, Limin},
  journal={arXiv preprint arXiv:2412.04449},
  year={2024}
}

💫 Acknowledgement