Skip to content

Commit

Permalink
upload ckpt
Browse files Browse the repository at this point in the history
  • Loading branch information
RenShuhuai-Andy committed Nov 6, 2023
1 parent 91520c9 commit 4581e43
Showing 1 changed file with 41 additions and 5 deletions.
46 changes: 41 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/testa-temporal-spatial-token-aggregation-for/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=testa-temporal-spatial-token-aggregation-for)

# :rocket: News
* **(Nov 11, 2023)**
* Upload 32-frame finetuned ckpt for paragraph-video retrieval.
* **(Oct 29, 2023)**
* Codes for video pre-training, video qa, video-paragraph retrieval.
* Checkpoints of pre-trained TESTA-base model.
Expand Down Expand Up @@ -50,15 +52,35 @@ Please follow the instructions at [DATASETS.md](docs/DATASETS.md) to prepare all

### Pre-trained model

zero-shot performance (32 frames):
zero-shot performance on paragraph-to-video retrieval:

| Model | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint |
|-----------------------|------------|------------|-------------------------|--------|--------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 64.4 | 64.9 | 37.1 | 786 | [testa_model_base_pretrain.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_pretrain/tree/main) |
| Model | frames | QuerYD R@1 | DiDeMo R@1 | ActivityNet Caption R@1 | GFLOPs | Checkpoint |
|-----------------------|--------|------------|------------|-------------------------|--------|--------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 32 | 64.4 | 64.9 | 37.1 | 786 | [testa_model_base_pretrain.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_pretrain/tree/main) |

### Fine-tuned model

To be uploaded...
#### QuerYD paragraph-to-video retrieval
| Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
|-----------------------|--------|------|------|------|--------|---------------------------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 32 | 77.0 | 90.8 | 92.6 | 420 | [testa_model_base_queryd_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_QuerYD_retrieval_ft/tree/main) |

#### ActivityNet paragraph-to-video retrieval
| Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
|-----------------------|--------|------|------|------|--------|------------------------------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 32 | 51.6 | 79.1 | 88.3 | 420 | [testa_model_base_anet_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_ActivityNet_retrieval_ft/tree/main) |

#### DiDeMo paragraph-to-video retrieval
| Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
|-----------------------|--------|------|------|------|--------|---------------------------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 32 | 57.7 | 83.3 | 89.4 | 420 | [testa_model_base_didemo_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_DiDeMo_retrieval_ft/tree/main) |


#### CondensedMovie paragraph-to-video retrieval
| Model | frames | R@1 | R@5 | R@10 | GFLOPs | Checkpoint |
|-----------------------|--------|------|------|------|--------|--------------------------------------------------------------------------------------------------------------------------------|
| TESTA-base (ViT-B/16) | 32 | 21.5 | 42.4 | 50.7 | 420 | [testa_model_base_cm_f32_f1p12.pth](https://huggingface.co/ShuhuaiRen/TESTA_model_base_CondensedMovies_retrieval_ft/tree/main) |


## Training and Evaluation
Please refer to the [RUN.md](docs/RUN.md) for detailed instructions on training, evaluating and reproducing the results.
Expand All @@ -68,5 +90,19 @@ Please refer to the [RUN.md](docs/RUN.md) for detailed instructions on training,
- [ ] Add visualization code
- [ ] Add demos

## Contact
If you have any questions, please feel free to create an issue on this repository.

## Citation
If you find this code useful for your research, please consider citing:
```
@inproceedings{Ren2023TESTA,
title={TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding},
author={Shuhuai Ren and Sishuo Chen and Shicheng Li and Xu Sun and Lu Hou},
year={2023},
journal={arXiv preprint arXiv:2310.19060},
}
```

## Acknowledgement
The codebase relies on resources from [BLIP](https://github.com/salesforce/BLIP), [ToMe](https://github.com/facebookresearch/ToMe),and [TimeSFormer](https://github.com/facebookresearch/TimeSformer). We thank the original authors for their open-sourcing.

0 comments on commit 4581e43

Please sign in to comment.