-
Notifications
You must be signed in to change notification settings - Fork 991
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Release of PnP-VQA (Anthony TMH et al, EMNLP Findings 2022). * add cache to gitignore * pnp-vqa-okvqa * rename readme * add extra config and misc change * edit model path * add gqa dataset * add gqa inference for generate * pnp-vqa misc change * compute gradcam in batch * allow straightforward model.predict_answers() * Created using Colaboratory * pnp_vqa * add test and update colab * add vqav2 test config and sh * log vqa result * fix dict key * reduce memory by offload model for pnpvqa3b * misc * update readme
- Loading branch information
1 parent
09e636c
commit c2cfb00
Showing
59 changed files
with
2,729 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,3 +152,5 @@ debug*/ | |
*.dat | ||
*.tsv | ||
*.gz | ||
|
||
cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png) | ||
|
||
# GQA Dataset | ||
|
||
## Description | ||
(from https://cs.stanford.edu/people/dorarad/gqa/about.html) | ||
|
||
GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning. | ||
It consists of 22M questions and 110K images. | ||
|
||
## Task | ||
(from https://arxiv.org/abs/1902.09506) | ||
|
||
Given an image and a question, the model is required to output a correct answer. | ||
GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference. | ||
|
||
## Metrics | ||
|
||
The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy. | ||
|
||
## Leaderboard | ||
|
||
TBD | ||
|
||
## Auto-Downloading | ||
|
||
``` | ||
cd lavis/datasets/download_scripts && python download_gqa.py | ||
``` | ||
|
||
## References | ||
"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright (c) 2022, salesforce.com, inc. | ||
# All rights reserved. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause | ||
|
||
datasets: | ||
gqa: | ||
# data_dir: ${env.data_dir}/datasets | ||
data_type: images # [images|videos|features] | ||
|
||
build_info: | ||
# Be careful not to append minus sign (-) before split to avoid itemizing | ||
annotations: | ||
train: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/train_balanced_questions.json | ||
storage: | ||
- gqa/annotations/train_balanced_questions.json | ||
val: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/testdev_balanced_questions.json | ||
storage: | ||
- gqa/annotations/testdev_balanced_questions.json | ||
test: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/test_balanced_questions.json | ||
storage: | ||
- gqa/annotations/test_balanced_questions.json | ||
images: | ||
storage: gqa/images/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright (c) 2022, salesforce.com, inc. | ||
# All rights reserved. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause | ||
|
||
datasets: | ||
gqa: | ||
# data_dir: ${env.data_dir}/datasets | ||
data_type: images # [images|videos|features] | ||
|
||
build_info: | ||
# Be careful not to append minus sign (-) before split to avoid itemizing | ||
annotations: | ||
train: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/train_balanced_questions.json | ||
storage: | ||
- gqa/annotations/train_balanced_questions.json | ||
val: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/val_balanced_questions.json | ||
storage: | ||
- gqa/annotations/val_balanced_questions.json | ||
test: | ||
url: | ||
- https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/test_balanced_questions.json | ||
storage: | ||
- gqa/annotations/test_balanced_questions.json | ||
images: | ||
storage: gqa/images/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Copyright (c) 2022, salesforce.com, inc. | ||
# All rights reserved. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause | ||
|
||
model: | ||
arch: pnp_vqa | ||
|
||
image_question_matching_model: | ||
arch: blip_image_text_matching | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco_train2014.pth" | ||
|
||
# vit encoder | ||
vit_type: "large" | ||
vit_grad_ckpt: False | ||
vit_ckpt_layer: 0 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
embed_dim: 256 | ||
|
||
image_captioning_model: | ||
arch: blip_caption | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption_coco_train2014.pth" | ||
|
||
vit_type: "large" | ||
vit_grad_ckpt: True | ||
vit_ckpt_layer: 5 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
# generation configs | ||
prompt: "a picture of " | ||
|
||
question_answering_model: | ||
arch: pnp_unifiedqav2_fid | ||
|
||
pretrained: "allenai/unifiedqa-v2-t5-3b-1363200" | ||
|
||
t5_config_path: "configs/models/pnp-vqa/unifiedqav2_3b_config.json" | ||
|
||
preprocess: | ||
vis_processor: | ||
eval: | ||
name: "blip_image_eval" | ||
image_size: 384 | ||
text_processor: | ||
eval: | ||
name: "blip_caption" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Copyright (c) 2022, salesforce.com, inc. | ||
# All rights reserved. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause | ||
|
||
model: | ||
arch: pnp_vqa | ||
|
||
image_question_matching_model: | ||
arch: blip_image_text_matching | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco_train2014.pth" | ||
|
||
# vit encoder | ||
vit_type: "large" | ||
vit_grad_ckpt: False | ||
vit_ckpt_layer: 0 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
embed_dim: 256 | ||
|
||
image_captioning_model: | ||
arch: blip_caption | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption_coco_train2014.pth" | ||
|
||
vit_type: "large" | ||
vit_grad_ckpt: True | ||
vit_ckpt_layer: 5 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
# generation configs | ||
prompt: "a picture of " | ||
|
||
question_answering_model: | ||
arch: pnp_unifiedqav2_fid | ||
|
||
pretrained: "allenai/unifiedqa-v2-t5-base-1363200" | ||
|
||
t5_config_path: "configs/models/pnp-vqa/unifiedqav2_base_config.json" | ||
|
||
preprocess: | ||
vis_processor: | ||
eval: | ||
name: "blip_image_eval" | ||
image_size: 384 | ||
text_processor: | ||
eval: | ||
name: "blip_caption" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Copyright (c) 2022, salesforce.com, inc. | ||
# All rights reserved. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause | ||
|
||
model: | ||
arch: pnp_vqa | ||
|
||
image_question_matching_model: | ||
arch: blip_image_text_matching | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco_train2014.pth" | ||
|
||
# vit encoder | ||
vit_type: "large" | ||
vit_grad_ckpt: False | ||
vit_ckpt_layer: 0 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
embed_dim: 256 | ||
|
||
image_captioning_model: | ||
arch: blip_caption | ||
load_finetuned: True | ||
|
||
finetuned: "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption_coco_train2014.pth" | ||
|
||
vit_type: "large" | ||
vit_grad_ckpt: True | ||
vit_ckpt_layer: 5 | ||
|
||
image_size: 384 | ||
|
||
# bert config | ||
med_config_path: "configs/models/med_large_config.json" | ||
|
||
# generation configs | ||
prompt: "a picture of " | ||
|
||
question_answering_model: | ||
arch: pnp_unifiedqav2_fid | ||
|
||
pretrained: "allenai/unifiedqa-v2-t5-large-1363200" | ||
|
||
t5_config_path: "configs/models/pnp-vqa/unifiedqav2_large_config.json" | ||
|
||
preprocess: | ||
vis_processor: | ||
eval: | ||
name: "blip_image_eval" | ||
image_size: 384 | ||
text_processor: | ||
eval: | ||
name: "blip_caption" |
Oops, something went wrong.