This software project accompanies the research paper, AutoFocusFormer: Image Segmentation off the Grid (CVPR 2023).
Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alex Schwing, Alex Colburn, Li Fuxin
arXiv | video narration | AFF-Classification (this repo) | AFF-Segmentation
AutoFocusFormer (AFF) is the first adaptive-downsampling network capable of dense prediction tasks such as semantic/instance segmentation.
AFF abandons the traditional grid structure of image feature maps, and automatically learns to retain the most important pixels with respect to the task goal.
AFF consists of a local-attention transformer backbone and a task-specific head. The backbone consists of four stages, each stage containing three modules: balanced clustering, local-attention transformer blocks, and adaptive downsampling.
AFF demonstrates significant savings on FLOPs (see our models with 1/5 downsampling rate), and significant improvement on recognition of small objects.
Notably, AFF-Small achieves 44.0 instance segmentation AP and 66.9 panoptic segmentation PQ on Cityscapes val with a backbone of only 42.6M parameters, a performance on par with Swin-Large, a backbone with 197M params (saving 78%!).
name | pretrain | resolution | acc@1 | acc@5 | #params | FLOPs | FPS | 1K model |
---|---|---|---|---|---|---|---|---|
AFF-Mini | ImageNet-1K | 224x224 | 78.2 | 93.6 | 6.75M | 1.08G | 1337 | Apple ML |
AFF-Mini-1/5 | ImageNet-1K | 224x224 | 77.5 | 93.3 | 6.75M | 0.72G | 1678 | Apple ML |
AFF-Tiny | ImageNet-1K | 224x224 | 83.0 | 96.3 | 27M | 4G | 528 | Apple ML |
AFF-Tiny-1/5 | ImageNet-1K | 224x224 | 82.4 | 95.9 | 27M | 2.74G | 682 | Apple ML |
AFF-Small | ImageNet-1K | 224x224 | 83.5 | 96.6 | 42.6M | 8.16G | 321 | Apple ML |
AFF-Small-1/5 | ImageNet-1K | 224x224 | 83.4 | 96.5 | 42.6M | 5.69G | 424 | Apple ML |
FPS is obtained on a single V100 GPU.
We train with a total batch size 4096.
name | pretrain | resolution | acc@1 | acc@5 | #params | FLOPs | 22K model | 1K model |
---|---|---|---|---|---|---|---|---|
AFF-Base | ImageNet-22K | 384x384 | 86.2 | 98.0 | 75.34M | 42.54G | Apple ML | Apple ML |
git clone [email protected]:apple/ml-autofocusformer.git
cd ml-autofocusformer
One can download pre-trained checkpoints through the links in the table above.
sh create_env.sh
See further documentation inside the script file.
Our experiments are run with CUDA==11.6
and pytorch==1.12
.
We use standard ImageNet dataset, which can be downloaded from http://image-net.org/.
For standard folder dataset, move validation images to labeled sub-folders. The file structure should look like:
$ tree imagenet
imagenet/
├── training
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── validation
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
Modify the arguments in script run_aff.sh
(e.g., path to dataset) and run
sh run_aff.sh
for training or evaluation.
Run python main.py -h
to see full documentation of the args.
One can also directly modify the config files in configs/
.
@inproceedings{autofocusformer,
title = {AutoFocusFormer: Image Segmentation off the Grid},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
author = {Ziwen, Chen and Patnaik, Kaushik and Zhai, Shuangfei and Wan, Alvin and Ren, Zhile and Schwing, Alex and Colburn, Alex and Fuxin, Li},
year = {2023},
}