Tian Liu1 · Huixin Zhang1 · Shubham Parashar1 · Shu Kong2
1Texas A&M University 2University of Macau
Our work adapts a pretrained Vision-Language Model (VLM) and retrieves relevant pretraining images to solve few-shot recognition problem.
To mitigate the domain gap
and imbalanced distribution
problems of retrieved data, we propose a novel Stage-Wise retrieval-Augmented fineTuning (SWAT) method, which outperforms previous few-shot recognition methods by >6% in accuracy across nine benchmark datasets.
- 2025-01-18: We provide access to our retrieved data through URLs. See RETRIEVAL.md.
- 2024-11-24: Updated code base to include more datasets.
- 2024-08-22: Retrieval code released, see RETRIEVAL.md.
- 2024-07-05: SWAT finetuning code released.
- 2024-06-28: project page launched.
- 2024-06-17: arXiv paper released.
Create conda environment and install dependencies following the instructions in ENV.md.
Prepare the datasets following the instructions in DATASETS.md.
Retrieve relevant pretraining data following the instructions in RETRIEVAL.md.
You can run SWAT and finetune on few-shot using the following bash scripts.
# 1. check the options in run_dataset_seed_xxx.sh,
# this can be used to run a batch of experiments.
# 2. run the corresponding bash script in command line
# Usage: bash scripts/run_dataset_seed_xxx.sh <dataset> [seed]
# finetune on few-shot, seed 1
bash scripts/run_dataset_seed_finetune_fewshot.sh semi-aves 1
# finetune on few-shot with CutMix, 3 seeds
bash scripts/run_dataset_seed_finetune_fewshot_cutmix.sh semi-aves
# swat
bash scripts/run_dataset_seed_SWAT.sh semi-aves 1
The results of the experiments will be saved in the result
directory. The detailed logs, models, and scores etc. will be saved in the output
directory.
Below we provide the commands to run the zero-shot and few-shot baselines in the paper. Update the model_cfg
option in the bash scripts to use different models.
Zero-shot methods:
# OpenCLIP zero-shot
bash scripts/run_dataset_zeroshot.sh semi-aves
# REAL-Prompt
bash scripts/run_dataset_REAL-Prompt.sh semi-aves
# REAL-Linear
# take the WSFT accuracy with alpha=0.5
# find the line: `Alpha:0.5, Val Acc: 48.671, Test Acc: 48.562`
bash scripts/run_dataset_REAL-Linear.sh semi-aves
Few-shot methods:
# Cross-modal Linear Probing (CMLP)
bash scripts/run_dataset_seed_CMLP.sh semi-aves 1
For CLAP, we use the provided code but replace the model from CLIP to OpenCLIP. Our implementation can be found in CLAP-tian with instructions.
This code base is developed with some references on the following projects. We sincerely thank the authors for open-sourcing their projects.
If you find our project useful, please consider citing:
@article{liu2024few,
title={Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning},
author={Liu, Tian and Zhang, Huixin and Parashar, Shubham and Kong, Shu},
journal={arXiv preprint arXiv:2406.11148},
year={2024}
}
@inproceedings{parashar2024neglected,
title={The Neglected Tails in Vision-Language Models},
author={Parashar, Shubham and Lin, Zhiqiu and Liu, Tian and Dong, Xiangjue and Li, Yanan and Ramanan, Deva and Caverlee, James and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}