upload ckpt

RenShuhuai-Andy · Nov 6, 2023 · 4581e43 · 4581e43
1 parent 91520c9
commit 4581e43
Showing 1 changed file with 41 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -13,6 +13,8 @@
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=testa-temporal-spatial-token-aggregation-for)
 
 # :rocket: News
+* **(Nov 11, 2023)** 
+  * Upload 32-frame finetuned ckpt for paragraph-video retrieval.
 * **(Oct 29, 2023)** 
   * Codes for video pre-training, video qa, video-paragraph retrieval.
   * Checkpoints of pre-trained TESTA-base model.
@@ -50,15 +52,35 @@ Please follow the instructions at [DATASETS.md](docs/DATASETS.md) to prepare all
 
 ### Pre-trained model
 
-zero-shot performance (32 frames):
+zero-shot performance on paragraph-to-video retrieval:
 
-| Model                 | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint                                                                                             |
-|-----------------------|------------|------------|-------------------------|--------|--------------------------------------------------------------------------------------------------------|
-| TESTA-base (ViT-B/16) | 64.4       | 64.9       | 37.1                    | 786    | [testa_model_base_pretrain.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_pretrain/tree/main) |
+| Model                 | frames | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint                                                                                             |
+|-----------------------|--------|------------|------------|-------------------------|--------|--------------------------------------------------------------------------------------------------------|
+| TESTA-base (ViT-B/16) | 32     | 64.4       | 64.9       | 37.1                    | 786    | [testa_model_base_pretrain.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_pretrain/tree/main) |
 
 ### Fine-tuned model
 
-To be uploaded...
+#### QuerYD paragraph-to-video retrieval
+| Model                 | frames | R@1  | R@5  | R@10 | GFLOPs | Checkpoint                                                                                                                |
+|-----------------------|--------|------|------|------|--------|---------------------------------------------------------------------------------------------------------------------------|
+| TESTA-base (ViT-B/16) | 32     | 77.0 | 90.8 | 92.6 | 420    | [testa_model_base_queryd_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_QuerYD_retrieval_ft/tree/main) |
+
+#### ActivityNet paragraph-to-video retrieval
+| Model                 | frames | R@1  | R@5  | R@10 | GFLOPs | Checkpoint                                                                                                                   |
+|-----------------------|--------|------|------|------|--------|------------------------------------------------------------------------------------------------------------------------------|
+| TESTA-base (ViT-B/16) | 32     | 51.6 | 79.1 | 88.3 | 420    | [testa_model_base_anet_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_ActivityNet_retrieval_ft/tree/main) |
+
+#### DiDeMo paragraph-to-video retrieval
+| Model                 | frames | R@1  | R@5  | R@10 | GFLOPs | Checkpoint                                                                                                                |
+|-----------------------|--------|------|------|------|--------|---------------------------------------------------------------------------------------------------------------------------|
+| TESTA-base (ViT-B/16) | 32     | 57.7 | 83.3 | 89.4 | 420    | [testa_model_base_didemo_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_DiDeMo_retrieval_ft/tree/main) |
+
+
+#### CondensedMovie paragraph-to-video retrieval
+| Model                 | frames | R@1  | R@5  | R@10 | GFLOPs | Checkpoint                                                                                                                     |
+|-----------------------|--------|------|------|------|--------|--------------------------------------------------------------------------------------------------------------------------------|
+| TESTA-base (ViT-B/16) | 32     | 21.5 | 42.4 | 50.7 | 420    | [testa_model_base_cm_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_CondensedMovies_retrieval_ft/tree/main) |
+
 
 ## Training and Evaluation
 Please refer to the [RUN.md](docs/RUN.md) for detailed instructions on training, evaluating and reproducing the results.
@@ -68,5 +90,19 @@ Please refer to the [RUN.md](docs/RUN.md) for detailed instructions on training,
 - [ ] Add visualization code
 - [ ] Add demos
 
+## Contact
+If you have any questions, please feel free to create an issue on this repository.
+
+## Citation
+If you find this code useful for your research, please consider citing:
+```
+@inproceedings{Ren2023TESTA,
+  title={TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding},
+  author={Shuhuai Ren and Sishuo Chen and Shicheng Li and Xu Sun and Lu Hou},
+  year={2023},
+  journal={arXiv preprint arXiv:2310.19060},
+}
+```
+
 ## Acknowledgement
 The codebase relies on resources from [BLIP](https://github.com/salesforce/BLIP), [ToMe](https://github.com/facebookresearch/ToMe),and [TimeSFormer](https://github.com/facebookresearch/TimeSformer). We thank the original authors for their open-sourcing.