Mengzi-Oscar is trained based on the English multimodal pre-training model Oscar,initialized with Mengzi-Bert-base, using 3.7M image-text pairs, including 0.7M Chinese Image-Caption pairs and 3M Chinese Image-Question pairs, for a total of 0.22M images.
Pre-training Model Download: Mengzi-Oscar.
Downstream Task Model Download: Chinese Image Caption. Chinese Image-Text Retrieval.
Generated Chinese Caption:绿油油的草地上有两个面带微笑的人在骑马。
English Version (translated for reference):two smiling men are riding horses on the green grass.
Generated Chinese Caption:两个打着伞的人和一个背着孩子的男人走在被水淹没的道路上。
English Version (translated for reference):Two people with umbrellas and a man with a child on his back walked along the flooded road.
Installation -- Install Oscar via github
Check for installation instructions.
Mengzi-Oscar used 3.7M Chinese Image-text pairs with the following data source distribution:
Source | VQA (train) |
GQA (bal-train) |
VG-QA (train) |
COCO (train) |
Flicker30k (train) |
Image/Text | 83k/545k | 79k/1026k | 87k/931k | 112k/559k | 29k/145k |
Image objects detection, feature extraction:
We use the open-source project X152-C4 object-attribute detection as an object detection tool, the project address: Scene Graph Benchmark Repo.
Pre-trained X152-C4 model download address.
Features are extracted by the following command:
# pretrained models at
# the associated labelmap at
python tools/ --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 \
MODEL.WEIGHT <path of vinvl_vg_x152c4.pth> \
DATA_DIR <path of image feature> \
OUTPUT_DIR <path to save extracted features> \
For the English label results of object detection, we provide en-to-zh word dictionary, you can convert English labels to Chinese labels by it. The pre-training data format, downstream task data format, and the original English data are visible in the open-source project Oscar
python -m torch.distributed.launch --nproc_per_node=8 oscar/ \
--use_b 1 --max_grad_norm 10.0 \
--gradient_accumulation_steps 1 --use_img_layernorm 1 \
--output_dir <output floder to save the pretrained model> \
--bert_model bert --do_lower_case \
--model_name_or_path <path of mengzi bert base model> \
--learning_rate 1e-04 --warmup_steps 0 --do_train --max_seq_length 35 \
--on_memory --max_img_seq_length 50 --img_feature_dim 2054 --drop_out 0.1 \
--train_batch_size 1024 --ckpt_period 10000 --max_iters 2000000 --log_period 1000 \
--data_dir <path of pretraining data> \
--dataset_file coco_flickr30k_gqa.yaml \
--textb_sample_mode 1 --texta_false_prob 0.25 --num_workers 8
See the object detection and feature extraction methods of pre-training data.
fine-tune on COCO image caption dataset(8 RTX 3090 24G)
python -m torch.distributed.launch --nproc_per_node=8 oscar/ \
--data_dir < path of downloaded coco dataset > \
--model_name_or_path <pat of pretrained Mengzi-Oscar model> \
--do_train --do_lower_case --add_od_labels --learning_rate 3e-5 \
--per_gpu_train_batch_size 128 --num_train_epochs 60 --tie_weights --freeze_embedding \
--label_smoothing 0.1 --drop_worst_ratio 0.2 --drop_worst_after 20000 \
--output_dir <path to save the fine-tune model> --num_workers 8
fine-tune on AIC-ICC train set, and inference on validation set(8 RTX 3090 24G)
python -m torch.distributed.launch --nproc_per_node=8 oscar/ \
--data_dir < path of AIC-ICC dataset > \
--model_name_or_path <path of pretrained model or finetuned coco caption model> \
--do_train --do_lower_case --add_od_labels --learning_rate 3e-5 \
--per_gpu_train_batch_size 128 --num_train_epochs 60 --tie_weights --freeze_embedding \
--label_smoothing 0.1 --drop_worst_ratio 0.2 --drop_worst_after 20000 \
--output_dir <path to save the finetuned model> --save_steps 1000 --logging_steps 1000 \
--evaluate_during_training --num_workers 8 --num_beams 5
inference on dataset
python -m torch.distributed.launch --nproc_per_node=8 oscar/ \
--data_dir <path of test dataset> \
--do_test --test_yaml test_ch.yaml \
--num_beams 5 --per_gpu_eval_batch_size 128 --max_gen_length 20 \
--eval_model_dir <path of fine-tuned Chinese Image Caption model>
We fine-tune the pre-training model on the COCO_ir dataset, and randomly select 1K pictures from the AIC-ICC validation set (each picture contains 5 ground truth captions) for evaluation.
See the object detection and feature extraction methods of pre-training data.
fine-tune on COCO_ir dataset:
python oscar/ --model_name_or_path <path of pretrained model>\
--data_dir <path of coco_ir> \
--img_feat_file <path of pretraining coco features.tsv>\
--do_train --do_lower_case --evaluate_during_training --num_captions_per_img_val 20 \
--eval_caption_index_file --per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 128 --learning_rate 2e-5 --num_train_epochs 30 --weight_decay 0.05 \
--add_od_labels --od_label_type vg --max_seq_length 70 --max_img_seq_length 70 \
--output_dir <path to save mdoel> --save_steps 5000 --logging_steps 500
evaluation on the AIC-ICC validation 1k dataset:
python mengzi-oscar/ --do_test --do_eval --test_split val \
--num_captions_per_img_val 5 --cross_image_eval --per_gpu_eval_batch_size 1024 \
--eval_model_dir <path of fintune model> --do_lower_case --add_od_labels \
--num_workers 4 --img_feat_file < path of AIC-ICC > \
--data_dir <path of AIC-ICC-ir> --eval_img_keys_file