Chinese multimodal pre-training model Mengzi-Oscar

Mengzi-Oscar is trained based on the English multimodal pre-training model Oscar，initialized with Mengzi-Bert-base, using 3.7M image-text pairs, including 0.7M Chinese Image-Caption pairs and 3M Chinese Image-Question pairs, for a total of 0.22M images.

Mengzi-Oscar - Download

Pre-training Model Download： Mengzi-Oscar.
Downstream Task Model Download： Chinese Image Caption. Chinese Image-Text Retrieval.

Chinese Image Caption Demo（Randomly select from the AIC-ICC val set）

Generated Chinese Caption：绿油油的草地上有两个面带微笑的人在骑马。
English Version (translated for reference)：two smiling men are riding horses on the green grass.

Generated Chinese Caption：两个打着伞的人和一个背着孩子的男人走在被水淹没的道路上。
English Version (translated for reference)：Two people with umbrellas and a man with a child on his back walked along the flooded road.

Quick Start（Pre-training / Image Caption / Retrieval）

Installation -- Install Oscar via github

Check INSTALL.md for installation instructions.

Pre-training

1）Data preparation for Pre-training

Mengzi-Oscar used 3.7M Chinese Image-text pairs with the following data source distribution:

Source	VQA (train)	GQA (bal-train)	VG-QA (train)	COCO (train)	Flicker30k (train)
Image/Text	83k/545k	79k/1026k	87k/931k	112k/559k	29k/145k

Image objects detection, feature extraction: We use the open-source project X152-C4 object-attribute detection as an object detection tool, the project address: Scene Graph Benchmark Repo.
Pre-trained X152-C4 model download address.
Features are extracted by the following command:

# pretrained models at https://penzhanwu2.blob.core.windows.net/sgg/sgg_benchmark/vinvl_model_zoo/vinvl_vg_x152c4.pth
# the associated labelmap at https://penzhanwu2.blob.core.windows.net/sgg/sgg_benchmark/vinvl_model_zoo/VG-SGG-dicts-vgoi6-clipped.json
python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 \
MODEL.WEIGHT <path of vinvl_vg_x152c4.pth> \
MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 \
DATA_DIR <path of image feature> \
OUTPUT_DIR <path to save extracted features> \
TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

For the English label results of object detection, we provide en-to-zh word dictionary, you can convert English labels to Chinese labels by it. The pre-training data format, downstream task data format, and the original English data are visible in the open-source project Oscar VinVL_DOWNLOAD.md.

2）Run pre-training commands（based on Mengzi-BERT-base）

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_oscarplus_pretrain.py \
--use_b 1 --max_grad_norm 10.0 \
--gradient_accumulation_steps 1 --use_img_layernorm 1 \
--output_dir <output floder to save the pretrained model> \
--bert_model bert --do_lower_case \
--model_name_or_path <path of mengzi bert base model> \
--learning_rate 1e-04 --warmup_steps 0 --do_train --max_seq_length 35 \
--on_memory --max_img_seq_length 50 --img_feature_dim 2054 --drop_out 0.1 \
--train_batch_size 1024 --ckpt_period 10000 --max_iters 2000000 --log_period 1000 \
--data_dir <path of pretraining data> \
--dataset_file coco_flickr30k_gqa.yaml \
--textb_sample_mode 1 --texta_false_prob 0.25 --num_workers 8

Chinese Image Caption（fine-tune on COCO & AIC-ICC）

1）data preparation for fine-tuning

See the object detection and feature extraction methods of pre-training data.

2）fine-tune

fine-tune on COCO image caption dataset（8 RTX 3090 24G）

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_captioning.py \
--data_dir < path of downloaded coco dataset > \
--model_name_or_path <pat of pretrained Mengzi-Oscar model> \
--do_train --do_lower_case --add_od_labels --learning_rate 3e-5 \
--per_gpu_train_batch_size 128 --num_train_epochs 60 --tie_weights --freeze_embedding \
--label_smoothing 0.1 --drop_worst_ratio 0.2 --drop_worst_after 20000 \
--output_dir <path to save the fine-tune model> --num_workers 8

fine-tune on AIC-ICC train set, and inference on validation set（8 RTX 3090 24G）

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_captioning.py \
--data_dir < path of AIC-ICC dataset > \
--model_name_or_path <path of pretrained model or finetuned coco caption model> \
--do_train --do_lower_case --add_od_labels --learning_rate 3e-5 \
--per_gpu_train_batch_size 128 --num_train_epochs 60 --tie_weights --freeze_embedding \
--label_smoothing 0.1 --drop_worst_ratio 0.2 --drop_worst_after 20000 \
--output_dir <path to save the finetuned model> --save_steps 1000 --logging_steps 1000 \
--evaluate_during_training --num_workers 8 --num_beams 5

inference on dataset

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_captioning.py \
--data_dir <path of test dataset> \
--do_test --test_yaml test_ch.yaml \
--num_beams 5 --per_gpu_eval_batch_size 128 --max_gen_length 20 \
--eval_model_dir <path of fine-tuned Chinese Image Caption model>

Chinese Image-Text Retrieval（fine-tune on COCO and inference on AIC-ICC）

We fine-tune the pre-training model on the COCO_ir dataset, and randomly select 1K pictures from the AIC-ICC validation set (each picture contains 5 ground truth captions) for evaluation.

1）data preparation for fine-tuning

See the object detection and feature extraction methods of pre-training data.

2）fine-tune

fine-tune on COCO_ir dataset:

python oscar/run_retrieval.py --model_name_or_path <path of pretrained model>\
--data_dir <path of coco_ir> \
--img_feat_file <path of pretraining coco features.tsv>\
--do_train --do_lower_case --evaluate_during_training --num_captions_per_img_val 20 \
--eval_caption_index_file minival_caption_indexs_top20.pt --per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 128 --learning_rate 2e-5 --num_train_epochs 30 --weight_decay 0.05  \
--add_od_labels --od_label_type vg --max_seq_length 70 --max_img_seq_length 70 \
--output_dir <path to save mdoel> --save_steps 5000 --logging_steps 500

evaluation on the AIC-ICC validation 1k dataset:

python mengzi-oscar/run_retrieval.py --do_test --do_eval --test_split val \
--num_captions_per_img_val 5 --cross_image_eval --per_gpu_eval_batch_size 1024 \
--eval_model_dir <path of fintune model> --do_lower_case --add_od_labels \
--num_workers 4 --img_feat_file < path of AIC-ICC val.feature.pt > \
--data_dir <path of AIC-ICC-ir> --eval_img_keys_file val_img_keys_1k.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mengzi-Oscar_en.md

Mengzi-Oscar_en.md

Chinese multimodal pre-training model Mengzi-Oscar

Mengzi-Oscar - Download

Chinese Image Caption Demo（Randomly select from the AIC-ICC val set）

Quick Start（Pre-training / Image Caption / Retrieval）

Installation -- Install Oscar via github

Pre-training

1）Data preparation for Pre-training

2）Run pre-training commands（based on Mengzi-BERT-base）

Chinese Image Caption（fine-tune on COCO & AIC-ICC）

1）data preparation for fine-tuning

2）fine-tune

Chinese Image-Text Retrieval（fine-tune on COCO and inference on AIC-ICC）

1）data preparation for fine-tuning

2）fine-tune

Files

Mengzi-Oscar_en.md

Latest commit

History

Mengzi-Oscar_en.md

File metadata and controls

Chinese multimodal pre-training model Mengzi-Oscar

Mengzi-Oscar - Download

Chinese Image Caption Demo（Randomly select from the AIC-ICC val set）

Quick Start（Pre-training / Image Caption / Retrieval）

Installation -- Install Oscar via github

Pre-training

1）Data preparation for Pre-training

2）Run pre-training commands（based on Mengzi-BERT-base）

Chinese Image Caption（fine-tune on COCO & AIC-ICC）

1）data preparation for fine-tuning

2）fine-tune

Chinese Image-Text Retrieval（fine-tune on COCO and inference on AIC-ICC）

1）data preparation for fine-tuning

2）fine-tune